Target detection method, device, medium and electronic device
By using the feature fusion and weighting modules in the feature extraction model, the problem of insufficient information utilization between feature pyramid levels is solved, thereby improving the accuracy and adaptability of target detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAN ORDNANCE IND TECH IND DEV CO LTD
- Filing Date
- 2025-02-11
- Publication Date
- 2026-06-23
AI Technical Summary
Existing object detection algorithms transfer information separately between different levels of the feature pyramid during feature extraction, resulting in the ineffective utilization of features at different scales and making it difficult to improve detection accuracy.
By using the first feature extraction module, the first feature fusion module, and the first feature weighting module in the feature extraction model, the fusion and weighting between pyramid features are enhanced, the importance of low-level and high-level information is balanced, and the feature extraction capability is improved.
It enhances the fusion of features at different scales, improves detection accuracy and the model's feature extraction capabilities, and adapts to multi-scale target detection.
Smart Images

Figure CN119888198B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision, and more particularly to a target detection method, apparatus, medium, and electronic device. Background Technology
[0002] Object detection is an important task in computer vision and has a wide range of applications, including intelligent security, autonomous driving, robot navigation, and medical diagnosis.
[0003] The primary goal of object detection is to accurately identify the category of objects and locate the position of specific targets in complex visual scenes. With technological advancements and the diversification of data sources, traditional single-scale features may not be effective in capturing the features of targets of different sizes and resolutions. To address this issue, modern object detection algorithms have introduced pyramid structures or Feature Pyramid Networks (FPNs) to organize feature images from different scales in a hierarchical manner. This strategy enables the algorithm to handle both small and large targets simultaneously, thereby improving detection accuracy and generalization ability.
[0004] However, existing object detection algorithms transfer information between different levels of the feature pyramid separately when extracting feature information, resulting in features at different scales not being effectively utilized, which makes it difficult to improve detection accuracy. Summary of the Invention
[0005] This application provides a target detection method, apparatus, medium, and electronic device that can enhance the fusion of features at different levels of the feature pyramid, thereby improving detection accuracy.
[0006] Firstly, this application provides a target detection method, comprising:
[0007] Obtain the input image to be detected;
[0008] The input image is input into the feature extraction model to extract features from the input image. The feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module.
[0009] The first feature extraction module extracts pyramid features from the input image, wherein the pyramid features include multiple levels of first feature maps;
[0010] The first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion;
[0011] The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features;
[0012] Target detection is performed on the features of the detection pyramid to obtain the detection results.
[0013] According to the target detection method in this embodiment, a feature extraction model is used to extract features from the input image. This feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module. The first feature extraction module extracts pyramid features from the input image. The first feature fusion module fuses feature maps from different levels of the pyramid features, enhancing the fusion between features at different scales. This balances the importance of information at lower and higher levels, avoiding the loss of local or global information and improving detection accuracy. Furthermore, the first feature weighting module weights the features at each level of the pyramid features, thereby enhancing the focus on and utilization of important features during feature extraction, accelerating model convergence, and improving the model's feature extraction capability.
[0014] In an exemplary embodiment, fusing the first feature maps of each level in the pyramid features to obtain a fused second feature map for each level includes:
[0015] The first feature map of the target layer in the pyramid features is fused with the first feature map of the adjacent upper layer of the target layer to obtain the second feature map of the target layer;
[0016] The target layer is the layer in the pyramid feature excluding the top layer; the second feature map of the top layer of the pyramid feature is the same as the first feature map.
[0017] In an exemplary embodiment, the feature extraction model further includes a second feature fusion module.
[0018] The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features, including:
[0019] The second feature fusion module fuses the second feature map of the target layer in the pyramid features with the second feature maps of each upper layer above the target layer to obtain the third feature map of the target layer.
[0020] Wherein, the target layer is the layer in the pyramid feature excluding the top layer; the third feature map of the top layer of the pyramid feature is the same as the second feature map;
[0021] The first feature weighting module weights the third feature maps of each level to obtain the detection pyramid features.
[0022] In an exemplary embodiment, the feature extraction model further includes a third feature fusion module, wherein the pyramid features are arranged from top to bottom as top layer, middle layer, and bottom layer.
[0023] The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features, including:
[0024] The third feature fusion module fuses the third feature map of the top layer of the pyramid features with the third feature map of the bottom layer to obtain the fourth feature map of the bottom layer; and
[0025] The third feature map of the top layer of the pyramid features is fused with the third feature map of the middle layer to obtain the fourth feature map of the middle layer.
[0026] The fourth feature map at the top layer of the pyramid features is the same as the third feature map.
[0027] The first feature weighting module weights the fourth feature map of each level to obtain the detection pyramid features.
[0028] In an exemplary embodiment, the first feature weighting module weights the second feature map at each level to obtain detection pyramid features, including:
[0029] The first feature weighting module performs max pooling on the second feature map of each level to obtain a first pooled feature map, and performs average pooling on the second feature map of each level to obtain a second pooled feature map.
[0030] The first feature weighting module combines the first pooling feature map and the second pooling feature map of each level to obtain the feature map to be screened at each level;
[0031] The first feature weighting module determines the weight of the feature map to be screened through an activation function, and performs weighting on the feature map to be screened based on the weight to obtain the detection pyramid features.
[0032] In an exemplary embodiment, fusing the first feature maps of each level in the pyramid features to obtain the fused second feature map of each level includes:
[0033] The first feature map of the upper layer of the target layer in the pyramid feature is upsampled to obtain the first feature map to be fused in the upper layer. The first feature map to be fused has the same size as the first feature map of the target layer.
[0034] The first feature map to be fused is fused with the first feature map of the target layer to obtain the second feature map of the target layer;
[0035] The target layer is the layer other than the topmost layer of the pyramid feature.
[0036] In an exemplary embodiment, the feature extraction model further includes a second feature weighting module, which is used to weight the fourth feature map of the top layer and the fourth feature map of the bottom layer, and to fuse the weighted feature maps to obtain the detection pyramid features.
[0037] Secondly, this application provides a target detection device, comprising:
[0038] The image acquisition module is used to acquire the input image to be detected;
[0039] An image input module is used to input the input image into a feature extraction model to extract features from the input image. The feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module.
[0040] The first feature extraction module extracts pyramid features from the input image, wherein the pyramid features include multiple levels of first feature maps;
[0041] The first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion;
[0042] The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features;
[0043] The detection output module is used to perform target detection on the detection pyramid features and obtain the detection results.
[0044] Thirdly, this application provides an electronic device including a memory and one or more processors. The memory stores one or more computer programs, each including instructions that, when executed by the processor, cause the electronic device to perform the target detection method as described in the first aspect.
[0045] Fourthly, this application provides a computer-readable storage medium storing instructions that, when executed on an electronic device, cause the electronic device to perform the target detection method as described in the first aspect.
[0046] Fifthly, this application provides a computer program product that, when run on an electronic device, causes the electronic device to perform the target detection method as described in the first aspect.
[0047] Understandably, the beneficial effects achieved by the target detection device, electronic device, computer-readable storage medium, and computer program product provided above can be referred to the beneficial effects in the first aspect, and will not be repeated here. Attached Figure Description
[0048] Figure 1 A schematic flowchart of the target detection method provided in the embodiments of this application;
[0049] Figure 2 This is a schematic diagram of the target detection device provided in the embodiments of this application;
[0050] Figure 3 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0051] To facilitate a clear description of the technical solutions in the embodiments of this application, the terms "first" and "second" are used in the embodiments of this application to distinguish identical or similar items with substantially the same function and effect. For example, "first chip" and "second chip" are only used to distinguish different chips and do not limit their order. Those skilled in the art will understand that the terms "first" and "second" do not limit the quantity or execution order, and the terms "first" and "second" do not necessarily imply that they are different. It should be noted that in the embodiments of this application, the words "exemplary" or "for example" are used to indicate that they are examples, illustrations, or descriptions. Any embodiment or design scheme described as "exemplary" or "for example" in this application should not be construed as being better or more advantageous than other embodiments or design schemes. Specifically, the use of the words "exemplary" or "for example" is intended to present the relevant concepts in a specific manner. In the embodiments of this application, "at least one" means one or more, and "more than one" means two or more.
[0052] It should be noted that "at the time of..." in the embodiments of this application can be either at the instant when a certain situation occurs, or for a period of time after the occurrence of a certain situation. The embodiments of this application do not make specific limitations on this.
[0053] The implementation of this embodiment will now be described in detail with reference to the accompanying drawings.
[0054] This embodiment provides a target detection method. For example, this target detection method can be applied to various electronic devices such as computers (PCs), tablets, virtual reality / augmented reality devices, wearable devices, industrial computers, and vehicle systems; it can also be applied to servers, cloud computing, server clusters, etc. This embodiment does not impose any special limitations on it.
[0055] Figure 1A flowchart illustrating the target detection method provided in an embodiment of this application is shown.
[0056] like Figure 1 As shown, the target detection method may include the following steps:
[0057] Step 101: Obtain the input image to be detected.
[0058] The input image can be an image obtained by taking a picture of the object to be detected, or an image obtained by preprocessing the picture. The object to be detected can be a scene, object, human body, animal, or face, etc. Preprocessing can include cropping, grayscale processing, slicing, etc., and this embodiment does not make any special limitations on the above.
[0059] Step 102: Input the input image into the feature extraction model and perform feature extraction on the input image. The feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module.
[0060] For example, the input image can be represented as image I. Image I can be sliced to obtain image S after slicing. Then, image S is input into the feature extraction model for feature extraction processing.
[0061] The slicing operation specifically involves dividing the image into sections according to a specific method, reducing the height and width of the input image, and simultaneously using multiple convolutional kernels to increase the number of channels in the input image. After the slicing operation, the spatial information of the input image is preserved, while the number of channels increases, forming a new feature map. For example, an input image of 3*256*256 can generate a 32*128*128 feature map after the slicing operation.
[0062] The feature extraction model in this embodiment may include a first feature extraction module, a first feature fusion module, and a first feature weighting module. These modules are used to extract features from the input image to obtain the detection pyramid features of the input image. Specifically:
[0063] Step 103: The first feature extraction module uses a convolutional neural network to extract pyramid features from the input image, and the pyramid features include multiple levels of first feature maps.
[0064] Step 104: The first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion.
[0065] Step 105: The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features.
[0066] Step 106: Perform target detection on the detection pyramid features to obtain the detection results.
[0067] The first feature extraction module, the first feature fusion module, and the first feature weighting module are all implemented using convolutional layers. These convolutional layers process the input image sequentially to obtain the detection pyramid features of the input image.
[0068] The first feature extraction module uses a convolutional neural network (CNN) for feature extraction, specifically a feature pyramid network. By inputting the input image into the feature pyramid network, pyramid features of the input image can be extracted. Pyramid features refer to multiple images of progressively smaller sizes, with each layer being a scaled-down version of the previous layer.
[0069] For example, this feature pyramid network can consist of a three-layer structure, each layer including: two convolutional layers, each with a 3x3 kernel and padding of 1; a batch normalization layer and a ReLU activation function can be added after each convolutional layer; and finally, a max pooling layer with a pooling window size of 2x2 and a stride of 2. The input image is first sliced to generate image S, and then feature extraction is performed on image S. After each layer of processing, a new feature map with a size reduced by half is obtained. After three layers of processing, three feature maps C1, C2, and C3 of different scales are obtained, which are the first feature maps. The scale of the first feature maps at different levels is different, from top to bottom, with the bottom layer having the largest scale and the top layer having the smallest scale. The bottom feature maps contain more local information, while the top feature maps contain global information.
[0070] The feature pyramid network can also include a structure with more layers, such as 5 layers, 6 layers, etc. Correspondingly, the pyramid features extracted by the feature pyramid network can also include more levels of first feature maps, such as 5 levels, 6 levels, etc.
[0071] After the feature pyramid network extracts pyramid features at multiple levels, these pyramid features can be used as input to the first feature fusion module. The first feature fusion module fuses the first feature maps of each level to obtain the second feature map of each level after fusion.
[0072] The first fusion module merges different first feature maps by adding or multiplying corresponding elements in the different first feature maps. Specifically, the first fusion module merges the first feature map of the target layer in the pyramid feature with the first feature map of the adjacent upper layer of the target layer to obtain the second feature map of the target layer. Here, the target layer is the layer in the pyramid feature except for the top layer, and the second feature map of the top layer of the pyramid feature is the same as the first feature map.
[0073] Taking a pyramid feature with three levels as an example, the three levels are called the top, middle and bottom levels from top to bottom. The top level is the top level, and the target level includes the middle and bottom levels.
[0074] When the target layer is a middle layer, its adjacent upper layer is the top layer; when the target layer is the bottom layer, its adjacent upper layer is the middle layer. The first fusion module can fuse the first feature map of the top layer and the first feature map of the middle layer, and the fused feature map becomes the second feature map of the middle layer. The first fusion module can also fuse the first feature map of the middle layer and the first feature map of the bottom layer, and the fused feature map becomes the second feature map of the bottom layer. The first feature map of the upper layer is directly passed as the second feature map to the next module, namely the first feature weighting module, for processing. After fusion, the second feature maps of each layer can form a new pyramid feature, which is then passed to the first feature weighting module.
[0075] In this embodiment, feature maps of adjacent scales are fused, and the model can learn feature representations of different scales, thus making it more suitable for detecting objects of different scales and improving the accuracy of multi-scale target detection.
[0076] For example, the feature extraction model further includes a second feature fusion module, which fuses the second feature map of the target layer in the pyramid feature with the second feature maps of each upper layer above the target layer to obtain the third feature map of the target layer; wherein, the target layer is the layer in the pyramid feature excluding the top layer; the third feature map of the top layer of the pyramid feature is the same as the second feature map; the first feature weighting module weights the third feature maps of each layer to obtain the detection pyramid feature.
[0077] Specifically, when the target layer is the bottom layer, the second feature fusion module can fuse the second feature maps of the top layer, the middle layer, and the bottom layer, and the fused feature map becomes the third feature map of the bottom layer. When the target layer is the middle layer, the second feature fusion module fuses the second feature maps of the top layer and the middle layer, and the fused feature map becomes the third feature map of the middle layer. The second feature map of the top layer is directly used as the third feature map of the top layer, and the third feature maps of each layer constitute a new pyramid feature map, which is then passed to the next module, namely the first feature weighting module, for processing.
[0078] In this embodiment, the feature map at the bottom layer can be fused with all the feature maps at the top layer, thereby learning the global information extracted by the top layer, enhancing the global information at the bottom layer, making the information contained in the features richer, and enhancing the representation ability of the features.
[0079] In an exemplary embodiment, the feature extraction model further includes a third feature fusion module, wherein the pyramid features are arranged from top to bottom as top layer, middle layer, and bottom layer. The third feature fusion module fuses the third feature map of the top layer with the third feature map of the bottom layer to obtain the fourth feature map of the bottom layer; and fuses the third feature map of the top layer with the third feature map of the middle layer to obtain the fourth feature map of the middle layer; the fourth feature map of the top layer of the pyramid features is the same as the third feature map; the first feature weighting module weights the fourth feature map of each layer to obtain the detection pyramid features.
[0080] For example, the pyramid features, from bottom to top, include feature map P1, feature map P2, and feature map P3. For the high-level feature map P3, the third feature fusion module first uses a transposed convolution with a stride of 2 and a kernel of 3*3 to expand the high-level feature map, increasing the size of feature map P3. Then, it uses bilinear interpolation to upsample the feature map P3 so that the sampled feature map P3 has the same size as the bottom feature map P1. Finally, the two are fused together.
[0081] In this implementation, the top and bottom layers are fused, and feature maps from non-adjacent layers are merged. This increases the scale difference between different layers, thereby learning to extract features under large scale differences. This makes the model more adaptable to feature extraction at different scales, which is beneficial for improving the accuracy of multi-scale object detection. The fused fourth feature map constitutes a new pyramid feature and is input into the first feature weighting module for weighting.
[0082] In an exemplary implementation, when fusing feature maps from different levels, the upper-level feature map can be upsampled first, making its size the same as the lower-level feature map, and then corresponding elements are added and fused to obtain the fused feature map. In general, the first feature map of the upper layer of the target layer in the pyramid feature is upsampled to obtain the first feature map to be fused, which has the same size as the first feature map of the target layer; the first feature map to be fused is then fused with the first feature map of the target layer to obtain the second feature map of the target layer.
[0083] Taking the pyramid feature as an example with three levels, this embodiment provides three feature fusion modules. Each module uses the above-described method when performing fusion processing. For two or three feature maps from different levels to be fused, the upper-level feature map is upsampled to increase its size so that it is the same size as the lower-level feature map before fusion. For example, when the target layer is the bottom layer, when fusing the first feature map of the bottom layer with the middle layer, the first feature map of the middle layer can be upsampled so that the sampled first feature map to be fused is the same size as the first feature map of the bottom layer. Then, the first feature map to be fused is fused with the first feature map of the bottom layer, and the fused feature map is used as the second feature map of the bottom layer. When fusing the second feature maps of the bottom layer with the middle and top layers, the second feature map of the middle layer can be upsampled so that the upsampled feature map is the same size as the second feature map of the bottom layer, and the second feature map of the top layer can also be upsampled so that the upsampled feature map is also the same size as the second feature map of the bottom layer. Then, the corresponding elements of the three feature maps with the same size are added and fused to obtain the fused feature map, which is used as the third feature map of the bottom layer. When fusing the third feature maps of the bottom layer and the top layer, the third feature map of the top layer is first upsampled so that the size of the sampled feature map is the same as that of the third feature map of the bottom layer. Then, the sampled feature map is fused with the third feature map of the bottom layer, and the fused feature map is used as the fourth feature map of the bottom layer.
[0084] The sampling factor can be different for each upsampling operation. Upsampling can be implemented using algorithms such as nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation, and this implementation is not limited to these.
[0085] After feature fusion, the resulting pyramid features can be input into the first feature weighting module. This module increases the weight of important features and decreases the weight of less important features, effectively filtering features and improving model training efficiency.
[0086] Specifically, the first feature weighting module employs a channel attention mechanism. By inputting the pyramid features into this module, the weights for each channel are obtained, and the weighted detection pyramid features are output. Alternatively, the first feature weighting module can first input the top, middle, and bottom feature maps from the pyramid features into the channel attention mechanism module for processing, obtaining the weights for each channel. These weights are then fused with the three feature maps before processing, and finally, a 1x1 convolution operation is performed to obtain the top, middle, and bottom feature maps after feature filtering.
[0087] In an exemplary implementation, the feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module. The first feature weighting module, connected after the first feature fusion module, is used to weight the second feature maps at each level to obtain detection pyramid features. This includes: the first feature weighting module performing max pooling on the second feature maps at each level to obtain a first pooled feature map, and performing average pooling on the second feature maps at each level to obtain a second pooled feature map; the first feature weighting module combining the first pooled feature map and the second pooled feature map at each level to obtain a feature map to be selected at each level; and the first feature weighting module determining the weights of the feature maps to be selected using an activation function, and weighting the feature maps to be selected based on these weights to obtain detection pyramid features.
[0088] The first feature weighting module first performs max pooling and average pooling on the second feature maps of each level in the input pyramid feature set. Then, it combines the two feature maps from each level that have undergone different pooling processes. Subsequently, it uses the sigmoid activation function to determine the weights for each channel. Max pooling aims to extract the most relevant data from each channel, while average pooling aims to uniformly extract all data from the feature maps. The combination of these pooling methods helps to extract the most representative information from each channel while minimizing information loss.
[0089] For example, the feature extraction model in this embodiment may sequentially include a first feature extraction module, a first feature fusion module, a second feature fusion module, a third feature fusion module, and a first feature weighting module. The first feature weighting module is connected after the third feature fusion module and performs weighted processing on the pyramid features output by the third feature fusion module to obtain the final detection pyramid features.
[0090] In an exemplary embodiment, the feature extraction model may further include, in sequence, a first feature extraction module, a first feature fusion module, a second feature fusion module, a first feature weighting module, a third feature fusion module, and a second feature weighting module.
[0091] The first feature weighting module receives the pyramid features output by the second feature fusion module, performs weighting processing on them, and then inputs the processed pyramid features into the third feature fusion module for fusion. The third feature fusion module can output the fused pyramid features as input to the second feature weighting module for further weighting processing.
[0092] The second feature weighting module uses a channel attention mechanism to weight the high-level and low-level feature maps in the pyramid feature set separately, and then fuses the weighted feature maps to enhance the model's feature representation. Specifically, the second feature weighting module can weight the fourth feature map of the top level, then weight the fourth feature map of the bottom level, and then fuse the weighted top-level and bottom-level features as the final detection pyramid feature.
[0093] The object detection module uses the detection pyramid features to detect objects and outputs the results. This module can be a YOLOv8 model, specifically the YOLO head. The detection results indicate the objects to be detected within the input image. First, the YOLOv8 model is trained. During training, the detection pyramid features are extracted from sample images using the steps described above. Then, the YOLOv8 model predicts the detection pyramid features of the sample images, identifying the regions with the highest probability of being the objects to be detected. These regions are then bounded to the actual bounding boxes of the detected objects. The loss function is calculated, and the gradient is calculated. Backpropagation of the gradient updates the parameters of the network model. This process is repeated to continuously optimize the network model, reducing the loss value until a network model with the minimum loss and best performance is obtained. The network divides the input image into an n*n grid. Each grid point has three pre-defined anchor boxes of different sizes. If the center of an object is within a grid, that grid is responsible for that object. Each grid predicts three bounding boxes, each containing five parameters: x-coordinate, y-coordinate, width, height, and center point confidence. The network iteratively calculates the loss value through backpropagation, continuously adjusts the anchor boxes, and then uses the trained network model to predict the detection pyramid features of the input image to obtain the final detection result.
[0094] Understandably, the aforementioned object detection module can be used as part of the feature extraction model, thereby enabling the feature extraction model to directly output detection results.
[0095] Furthermore, this embodiment also provides a target detection device, which can be used to perform the above-described target detection method. For example... Figure 2As shown, the target detection device 200 may specifically include: an image acquisition module 201, used to acquire an input image to be detected; an image input module 202, used to input the input image into a feature extraction model to extract features from the input image, the feature extraction model including a first feature extraction module 203, a first feature fusion module 204, and a first feature weighting module 205; the first feature extraction module 203 extracts pyramid features from the input image, the pyramid features including multiple levels of first feature maps; the first feature fusion module 204 fuses the first feature maps of each level in the pyramid features to obtain a second feature map of each level after fusion; the first feature weighting module 205 weights the second feature map of each level to obtain detection pyramid features; and a detection output module 206, used to perform target detection on the detection pyramid features to obtain a detection result.
[0096] In one exemplary embodiment, the first feature fusion module 204 is specifically used to fuse the first feature map of the target layer in the pyramid feature with the first feature map of the adjacent upper layer of the target layer to obtain the second feature map of the target layer; wherein, the target layer is the layer in the pyramid feature other than the top layer; the second feature map of the top layer of the pyramid feature is the same as the first feature map.
[0097] In one exemplary embodiment, the feature extraction model further includes a second feature fusion module, which fuses the second feature map of the target layer in the pyramid feature with the second feature maps of each upper layer above the target layer to obtain the third feature map of the target layer; wherein, the target layer is the layer in the pyramid feature excluding the top layer; the third feature map of the top layer of the pyramid feature is the same as the second feature map; the first feature weighting module weights the third feature maps of each layer to obtain the detection pyramid feature.
[0098] In one exemplary embodiment, the feature extraction model further includes a third feature fusion module. The pyramid features are arranged from top to bottom as top layer, middle layer, and bottom layer. The third feature fusion module fuses the third feature map of the top layer with the third feature map of the bottom layer to obtain the fourth feature map of the bottom layer; and fuses the third feature map of the top layer with the third feature map of the middle layer to obtain the fourth feature map of the middle layer. The fourth feature map of the top layer of the pyramid features is the same as the third feature map. The first feature weighting module weights the fourth feature map of each layer to obtain the detection pyramid features.
[0099] In one exemplary embodiment, the detection output module 206 is specifically configured to: perform max pooling on the second feature map of each level to obtain a first pooled feature map, and perform average pooling on the second feature map of each level to obtain a second pooled feature map; combine the first pooled feature map and the second pooled feature map of each level to obtain a feature map to be screened at each level; and determine the weight of the feature map to be screened by an activation function, and perform weighting on the feature map to be screened based on the weight to obtain the detection pyramid features.
[0100] In one exemplary embodiment, the first feature fusion module 204 is specifically used to: upsample the first feature map of the upper layer of the target layer in the pyramid feature to obtain the first feature map to be fused of the upper layer, wherein the first feature map to be fused has the same size as the first feature map of the target layer; fuse the first feature map to be fused with the first feature map of the target layer to obtain the second feature map of the target layer; wherein the target layer is the layer other than the uppermost layer of the pyramid feature.
[0101] In one exemplary embodiment, a second feature weighting module is further included, which is used to weight the fourth feature map of the top layer and the fourth feature map of the bottom layer, and to fuse the weighted feature maps to obtain the detection pyramid features.
[0102] The specific details of each module or unit in the above-mentioned target detection device have been described in detail in the corresponding target detection method, so they will not be repeated here.
[0103] This application also provides an electronic device. Figure 3 A schematic diagram of the structure of an electronic device suitable for implementing embodiments of the present disclosure is shown. Figure 3 The electronic device 600 shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments disclosed herein.
[0104] like Figure 3 As shown, the electronic device 600 includes a central processing unit (CPU) 601, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 602 or a program loaded from a storage section 608 into a random access memory (RAM) 603. The RAM 603 also stores various programs and data required for system operation. The CPU 601, ROM 602, and RAM 603 are interconnected via a bus 604. An input / output (I / O) interface 605 is also connected to the bus 604.
[0105] The following components are connected to I / O interface 605: an input section 606 including a keyboard, mouse, etc.; an output section 607 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 608 including a hard disk, etc.; and a communication section 609 including a network interface card such as a LAN card, modem, etc. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to I / O interface 605 as needed. A removable medium 611, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 610 as needed so that computer programs read from it can be installed into storage section 608 as needed.
[0106] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 609, and / or installed from removable medium 611. When the computer program is executed by central processing unit (CPU) 601, it performs the functions defined in the embodiments of this application.
[0107] For example, when the computer program is executed by the central processing unit (CPU) 601, it can perform the following: acquire an input image to be detected; input the input image into a feature extraction model to extract features from the input image, the feature extraction model including a first feature extraction module, a first feature fusion module, and a first feature weighting module; the first feature extraction module extracts pyramid features from the input image, the pyramid features including multiple levels of first feature maps; the first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain a fused second feature map of each level; the first feature weighting module weights the second feature map of each level to obtain detection pyramid features; and performs target detection on the detection pyramid features to obtain a detection result.
[0108] It should be noted that the computer-readable medium disclosed herein may be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium may be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0109] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0110] The units described in the embodiments of this disclosure can be implemented in software or hardware, and the described units can also be located in a processor. The names of these units do not necessarily limit the unit itself.
[0111] In another aspect, this application also provides a computer-readable medium, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The computer-readable medium carries one or more programs, which include instructions that, when executed by the electronic device, cause the electronic device to perform the methods described in the above embodiments.
[0112] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.
[0113] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A target detection method, characterized in that, include: Obtain the input image to be detected; The input image is input into the feature extraction model to extract features from the input image. The feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module. The first feature extraction module extracts pyramid features from the input image, wherein the pyramid features include multiple levels of first feature maps; The first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion; The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features; Target detection is performed on the detection pyramid features to obtain the detection results; The step of fusing the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion includes: The first feature map of the target layer in the pyramid features is fused with the first feature map of the adjacent upper layer of the target layer to obtain the second feature map of the target layer; Wherein, the target layer is the layer in the pyramid feature excluding the top layer; the second feature map of the top layer of the pyramid feature is the same as the first feature map; The feature extraction model also includes a second feature fusion module. The second feature fusion module fuses the second feature map of the target layer in the pyramid features with the second feature maps of each upper layer above the target layer to obtain the third feature map of the target layer. Wherein, the target layer is the layer in the pyramid feature excluding the top layer; the third feature map of the top layer of the pyramid feature is the same as the second feature map; The first feature weighting module weights the third feature maps of each level to obtain the detection pyramid features; The feature extraction model also includes a third feature fusion module, and the pyramid features are arranged from top to bottom as top layer, middle layer, and bottom layer. The third feature fusion module fuses the third feature map of the top layer of the pyramid features with the third feature map of the bottom layer to obtain the fourth feature map of the bottom layer; and The third feature map of the top layer of the pyramid features is fused with the third feature map of the middle layer to obtain the fourth feature map of the middle layer. The fourth feature map at the top layer of the pyramid features is the same as the third feature map. The first feature weighting module weights the fourth feature map of each level to obtain the detection pyramid features.
2. The target detection method according to claim 1, characterized in that, The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features, including: The first feature weighting module performs max pooling on the second feature map of each level to obtain a first pooled feature map, and performs average pooling on the second feature map of each level to obtain a second pooled feature map. The first feature weighting module combines the first pooling feature map and the second pooling feature map of each level to obtain the feature map to be screened at each level; The first feature weighting module determines the weight of the feature map to be screened through an activation function, and performs weighting on the feature map to be screened based on the weight to obtain the detection pyramid features.
3. The target detection method according to claim 1, characterized in that, The step of fusing the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion includes: The first feature map of the upper layer of the target layer in the pyramid feature is upsampled to obtain the first feature map to be fused in the upper layer. The first feature map to be fused has the same size as the first feature map of the target layer. The first feature map to be fused is fused with the first feature map of the target layer to obtain the second feature map of the target layer; The target layer is the layer other than the topmost layer of the pyramid feature.
4. The target detection method according to claim 1, characterized in that, The feature extraction model also includes a second feature weighting module, which is used to weight the fourth feature map of the top layer and the fourth feature map of the bottom layer, and to fuse the weighted feature maps to obtain the detection pyramid features.
5. A target detection device, characterized in that, include: The image acquisition module is used to acquire the input image to be detected; An image input module is used to input the input image into a feature extraction model to extract features from the input image. The feature extraction model includes a first feature extraction module, a first feature fusion module, and a first feature weighting module. The first feature extraction module extracts pyramid features from the input image, wherein the pyramid features include multiple levels of first feature maps; The first feature fusion module fuses the first feature maps of each level in the pyramid features to obtain the second feature map of each level after fusion; The first feature weighting module weights the second feature map at each level to obtain the detection pyramid features; The detection output module is used to perform target detection on the detection pyramid features and obtain the detection results. The first feature fusion module is specifically used to: fuse the first feature map of the target layer in the pyramid feature with the first feature map of the adjacent upper layer of the target layer to obtain the second feature map of the target layer; wherein, the target layer is the layer in the pyramid feature other than the top layer; the second feature map of the top layer of the pyramid feature is the same as the first feature map; The feature extraction model further includes a second feature fusion module, which fuses the second feature map of the target layer in the pyramid feature with the second feature maps of each upper layer above the target layer to obtain the third feature map of the target layer. Wherein, the target layer is the layer in the pyramid feature excluding the top layer; the third feature map of the top layer of the pyramid feature is the same as the second feature map; The first feature weighting module weights the third feature maps of each level to obtain the detection pyramid features; The feature extraction model also includes a third feature fusion module, and the pyramid features are arranged from top to bottom as top layer, middle layer, and bottom layer. The third feature fusion module fuses the third feature map of the top layer of the pyramid features with the third feature map of the bottom layer to obtain the fourth feature map of the bottom layer; and The third feature map of the top layer of the pyramid features is fused with the third feature map of the middle layer to obtain the fourth feature map of the middle layer. The fourth feature map at the top layer of the pyramid features is the same as the third feature map. The first feature weighting module weights the fourth feature map of each level to obtain the detection pyramid features.
6. A computer-readable storage medium storing a computer program, which, when executed by a processor, causes the processor to perform the target detection method as described in any one of claims 1 to 4.
7. An electronic device, characterized in that, The device includes a processor and a memory, the memory storing one or more computer programs, the one or more computer programs including instructions that, when executed by the electronic device, cause the electronic device to perform the target detection method according to any one of claims 1 to 4.