Target detection method and device based on target detection model

By using a target detection model-based approach, feature extraction units and similarity feature maps are used to determine the presence of targets, solving the problem of high data collection and annotation costs in existing technologies and achieving efficient and accurate target detection.

CN114462497BActive Publication Date: 2026-06-23ZHEJIANG DAHUA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG DAHUA TECH CO LTD
Filing Date
2021-12-30
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing object detection algorithms require significant human resources for data collection and labeling, and the amount of data directly affects model performance, which is both time-consuming and labor-intensive.

Method used

A target detection model-based approach is adopted, which obtains multi-layer feature maps of the target image and the image to be detected through a feature extraction unit. The similarity feature map is used to determine whether the image to be detected contains the target to be processed. Multiple detection information is fused to confirm the existence of the target, reducing the dependence on classification and detection units. The method is trained using an existing dataset.

Benefits of technology

It saves on manual data collection and storage costs, supports the detection of multiple target types, eliminates the need for additional dataset collection, reduces model deployment and maintenance costs, and improves the accuracy of detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114462497B_ABST
    Figure CN114462497B_ABST
Patent Text Reader

Abstract

The application discloses a target detection method and device based on a target detection model. The target detection method based on the target detection model comprises the following steps: inputting a target image into a feature extraction unit to obtain a plurality of layers of feature maps of the target image; inputting a to-be-detected image into the feature extraction unit to obtain a plurality of layers of feature maps of the to-be-detected image; and determining whether the to-be-detected image contains a to-be-processed target based on a similar feature map of each layer of feature maps of the to-be-detected image and a corresponding layer of feature maps of the target image. The application can save the cost of artificial data collection, labeling and data storage.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image detection technology, and in particular to a target detection method and apparatus based on a target detection model. Background Technology

[0002] Object detection technology identifies objects of interest in images or videos. Unlike object classification, object detection addresses both classification and localization. As a fundamental problem in computer vision, object detection forms the basis for tasks such as instance segmentation, image annotation, and object tracking. Traditional object detection techniques primarily rely on manual feature extraction and generally involve three steps: selecting regions of interest, extracting features from regions that may contain objects, and classifying the extracted features.

[0003] Existing object detection algorithms are based on big data. For the type of object to be detected, a large amount of dataset of that type of object needs to be collected first for the next step of model training. The amount of data directly affects the performance of existing object detection models. Data collection and labeling is a time-consuming and labor-intensive process that consumes a lot of human resources. Summary of the Invention

[0004] This application provides a target detection method and apparatus based on a target detection model, which can save on manual data collection, annotation, and data storage costs.

[0005] To achieve the above objectives, this application provides a target detection method based on a target detection model, wherein the target detection model includes a feature extraction unit with multi-layer output, and the method includes:

[0006] Acquire a target image and an image to be detected, wherein the target image contains the target to be processed;

[0007] The target image is input into the feature extraction unit to obtain a multi-layer feature map of the target image; the image to be detected is input into the feature extraction unit to obtain a multi-layer feature map of the image to be detected.

[0008] Based on the similarity feature maps of each layer of the image to be detected and the corresponding layer of the target image, it is determined whether the image to be detected contains the target to be processed.

[0009] Specifically, determining whether the target to be processed is contained in the image to be detected based on the similarity feature maps of each layer of the feature map of the image to be detected and the feature maps of the corresponding layer of the target image includes:

[0010] Based on the feature map of each layer of the image to be detected and the feature map of the corresponding layer of the target image, multiple similarity feature maps are obtained;

[0011] Based on each of the similarity feature maps, the detection information of the target to be processed in the image to be detected is determined;

[0012] The obtained detection information is fused to determine whether the target to be processed is contained in the image to be detected.

[0013] The detection information for each similarity feature map includes the detection feature map for each similarity feature map;

[0014] The step of fusing multiple detection information to determine whether the target to be processed is contained in the image to be detected includes:

[0015] The detection feature maps of the multiple similarity feature maps are weighted and fused to obtain the final detection information of the image to be detected;

[0016] Based on the final detection information, it is determined whether the target to be processed is contained in the image to be detected.

[0017] Each of the target detection models includes a classification unit and a detection unit, and the detection information of each similarity feature map includes a classification feature map of each similarity feature map. The step of determining the detection information of the target to be processed in the image to be detected based on each similarity feature map includes:

[0018] Each similarity feature map is input into the classification unit to obtain a classification feature map for each similarity feature map; each similarity feature map is input into the detection unit to obtain a detection feature map for each similarity feature map;

[0019] The process of fusing the detection information from the multiple similarity feature maps includes:

[0020] The classification feature maps of the multiple similarity feature maps are weighted and fused to obtain the final classification feature map of the image to be detected; the detection feature maps of the multiple similarity feature maps are weighted and fused to obtain the final detection feature map of the image to be detected.

[0021] The method further includes:

[0022] If a target to be processed exists in the image to be detected, the location information of the target to be processed is output based on the final detection feature map, and the confidence information of the target to be processed is output based on the final classification feature map.

[0023] The step of determining whether the image to be detected contains the target to be processed includes:

[0024] Based on the determination result of whether the target to be processed is contained in the image to be detected, the parameters of the target detection model are optimized;

[0025] The image to be detected is subjected to size transformation and / or position transformation;

[0026] The transformed image to be detected is used as the image to be detected. The process of inputting the target image into the feature extraction unit to obtain the multi-layer feature map of the target image is then repeated. The image to be detected is then input into the feature extraction unit to obtain the multi-layer feature map of the image to be detected. The parameters of the target detection model are then optimized again using the transformed image to be detected and the target image until the target detection model converges.

[0027] The optimization of the parameters of the target detection model based on the determination result of whether the image to be detected contains the target to be processed includes:

[0028] Based on the determination result of whether the target to be processed is contained in the image to be detected, the classification loss and regression loss of the target detection model are calculated;

[0029] Calculate the total loss based on the classification loss and the regression loss;

[0030] The parameters of the target detection model are optimized based on the total loss.

[0031] The feature extraction unit includes two feature extraction branches with identical structures, and the feature extraction branches have multiple outputs.

[0032] The step of inputting the target image into the feature extraction unit to obtain a multi-layer feature map of the target image includes:

[0033] The target image is input into a feature extraction branch of the feature extraction unit to obtain a multi-layer feature map of the target image;

[0034] The step of inputting the image to be detected into the feature extraction unit to obtain a multi-layer feature map of the image to be detected includes:

[0035] The image to be detected is input into another feature extraction branch of the feature extraction unit to obtain a multi-layer feature map of the image to be detected.

[0036] The target detection model includes convolutional units, and multiple similarity feature maps are obtained based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, including:

[0037] Using the convolutional unit, the feature maps of the corresponding layers of the image to be detected are convolved with the feature maps of each layer of the target image as the convolutional kernel to obtain multiple similarity feature maps.

[0038] The process of obtaining multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image includes the following steps:

[0039] The resolution is normalized for the multi-layer feature map of the target image, and the resolution is normalized for the multi-layer feature map of the image to be detected.

[0040] The method obtains multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, including:

[0041] Multiple similarity feature maps are obtained based on the resolution-normalized feature maps of each layer of the target image and the resolution-normalized feature maps of the corresponding layers of the image to be detected.

[0042] The process of performing resolution normalization processing on the multi-layer feature map of the target image and on the multi-layer feature map of the image to be detected includes:

[0043] The multi-layer feature map of the target image, which has undergone resolution normalization, is subjected to dimensionality normalization. The multi-layer feature map of the image to be detected, which has undergone resolution normalization, is also subjected to dimensionality normalization.

[0044] The method obtains multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, including:

[0045] Multiple similarity feature maps are obtained based on the feature maps of each layer after dimension normalization of the target image and the feature maps of the corresponding layers after dimension normalization of the image to be detected.

[0046] To achieve the above objectives, this application also provides an electronic device including a processor; the processor is configured to execute instructions to implement the above methods.

[0047] To achieve the above objectives, this application also provides a computer-readable storage medium for storing instruction / program data that can be executed to implement the above methods.

[0048] In the target detection method based on the target detection model in this application, the feature extraction unit extracts features from the target image to obtain feature maps of the target image and the image to be detected. Then, the similarity feature map between the feature map of the image to be detected and the feature map of the target image is calculated. That is, this application directly compares each region in the feature map of the image to be detected with the target image by calculating the similarity feature map to confirm whether there is a target to be processed in each region of the image to be detected. Thus, the target detection model of the target detection method based on the target detection model in this application only needs to learn how to determine the similarity between each region in the feature map of the image to be detected and the feature map of the target image. It does not require the classification unit and detection unit in the target detection model to learn a large number of features of the target to be processed. In this case, existing datasets (public datasets, The target detection model can be trained using a dataset that has already been collected and labeled in the previous stage, without the need to collect a large amount of target data to be processed. This saves on the costs of manual data collection, labeling, and data storage. Furthermore, the target detection model of this application can support the detection of various target types. When a new target type is added, there is no need to update the network again, thus reducing the deployment and maintenance costs of the model. In addition, this application obtains multi-layer feature maps of the target image and the target image through a feature extraction unit. Based on the similarity feature map between each layer of the target image and the corresponding layer of the target image, it determines whether the target is contained in the target image. Thus, the target detection result is obtained based on the similarity feature maps of all levels. By fusing the detection information obtained from semantic information at different levels, the accuracy of the target detection result can be improved. Attached Figure Description

[0049] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0050] Figure 1 This is a flowchart illustrating one implementation of the target detection method based on the target detection model of this application;

[0051] Figure 2 This is a schematic diagram of the detection process of the target detection method based on the target detection model in this application;

[0052] Figure 3 This is a flowchart illustrating another implementation of the target detection method based on the target detection model of this application;

[0053] Figure 4 This is a schematic diagram of the feature comparison network in the target detection method based on the target detection model in this application;

[0054] Figure 5This is a schematic diagram of the target detection model in the target detection method based on the target detection model in this application;

[0055] Figure 6 This is a schematic diagram of the structure of one embodiment of the electronic device of this application;

[0056] Figure 7 This is a schematic diagram of one embodiment of the computer-readable storage medium of this application. Detailed Implementation

[0057] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application. In addition, unless otherwise specified (e.g., "or additionally" or "or in alternatives"), the term "or" as used herein refers to a non-exclusive "or" (i.e., "and / or"). Furthermore, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.

[0058] Specifically, such as Figure 1 As shown, the target detection method based on the target detection model of this application includes the following steps. It should be noted that the step numbers are for simplification only and are not intended to limit the execution order of the steps. The execution order of each step in this embodiment can be arbitrarily changed without departing from the technical concept of this application.

[0059] S101: Acquire the target image and the image to be detected.

[0060] It can acquire the target image and the image to be detected, so as to obtain the multi-layer feature maps of the target image and the image to be detected. Then, based on the multi-layer feature maps of the target image and the image to be detected, multiple similarity feature maps of the image to be detected and the target image can be obtained. Then, based on the multiple similarity feature maps of the image to be detected and the target image, it can be determined whether the image to be detected contains the target to be processed.

[0061] Among them, such as Figure 2 As shown, the target image contains the target to be processed. Preferably, the proportion of the target in the target image is greater than a preset threshold, thus ensuring the accuracy of the final detection information (e.g., the location information of the target to be processed). Optionally, the specific value of the preset threshold can be set according to actual conditions and is not limited here; for example, it can be 95% or 90%.

[0062] The types of targets to be processed are not limited; for example, they can be household items, appliances, people, dogs, pigs, or cats.

[0063] In addition, there are no restrictions on how the target image and the image to be detected are acquired.

[0064] For example, a target image is obtained by photographing the target. As another example, during the training of the target detection model using the target detection model-based method of this application, the actual bounding box region of the target in the image to be detected can be used as the target image. Yet another example, during target tracking using the target detection model-based method of this application, the actual bounding box region of the target in a certain frame of the tracking video_Sequence can be used as the target image, and any other frame in the tracking video_Sequence can be used as the image to be detected.

[0065] S102: Input the target image into a feature extraction unit with multi-layer output to obtain a multi-layer feature map of the target image.

[0066] S103: Input the image to be detected into the feature extraction unit to obtain a multi-layer feature map of the image to be detected.

[0067] The target image can be input into a feature extraction unit with multi-layer output to obtain a multi-layer feature map of the target image. The image to be detected can be input into the feature extraction unit to obtain a multi-layer feature map of the image to be detected. Then, based on the multi-layer feature map of the target image and the multi-layer feature map of the image to be detected, multiple similarity feature maps of the image to be detected and the target image can be obtained.

[0068] In one implementation, the feature extraction unit includes only one feature extraction branch with multi-layer output. This feature extraction branch can be used to extract features from one of the images to be detected and the target image to obtain a multi-layer feature map of one of the images to be detected and the target image. Then, the same feature extraction branch can be used to extract features from the other image to be detected and the target image to obtain a multi-layer feature map of the other image to be detected and the target image. In this way, the multi-layer feature maps of the target image and the image to be detected can be obtained.

[0069] In another implementation, the feature extraction unit is a Siamese network, which includes two feature extraction branches with identical structures and multi-layer outputs. These two feature extraction branches are used to extract features from the image to be detected and the target image, respectively, to obtain multi-layer feature maps of the image to be detected and the target image.

[0070] Preferably, the two feature extraction branches in the Siamese network can have the same parameters, meaning they share parameters. This reduces the number of parameters to be optimized during training and ensures that the two branches extract the same features from the same image, thereby improving the accuracy of the detection results determined based on the similarity feature map between the target image and the image to be detected. In other alternative embodiments, the two feature extraction branches in the Siamese network can have different parameters.

[0071] The structure of the feature extraction branch within the feature extraction unit is unrestricted. For example, it can be a ResNet, VGG, DenseNet, CNN, or similar structure.

[0072] It is understandable that the aforementioned feature extraction branch can have multiple output layers, thus obtaining features from different layers of the target image and the image to be detected. Each layer focuses on different semantic information; shallower features focus more on information such as image edges, while higher-layer features are more abstract and contain more semantic information. For example, the feature extraction branch may include multiple sequentially connected convolutional layers. If at least two convolutional layers in the feature extraction branch can output feature maps, then multiple layers of the image can be obtained through this feature extraction branch.

[0073] Furthermore, the size of the feature map of the image to be detected can be smaller than the size of the feature map of the target image, so that the size of the target to be processed in the image to be detected can be determined subsequently based on the feature map of the target image. For example, the size of the image to be detected is 31*31, and the size of the target image is 7*7.

[0074] Furthermore, to facilitate the multi-layer feature maps of the image to be detected and the multi-layer feature maps of the target image... Figure 1 In a one-to-one correspondence, the total number of layers in the multi-layer feature map of the image to be detected and the total number of layers in the multi-layer feature map of the target image can both be N.

[0075] S104: Based on the similarity feature maps of each layer feature map of the image to be detected and the corresponding layer feature map of the target image, determine whether the image to be detected contains the target to be processed.

[0076] After obtaining the multi-layer feature maps of the image to be detected and the target image based on the above steps, it is possible to determine whether the image to be detected contains the target to be processed based on the similarity feature maps of each layer of the feature map of the image to be detected and the corresponding layer of the feature map of the target image.

[0077] In this embodiment, the feature extraction unit extracts features from the target image to obtain feature maps of the target image and the image to be detected. Then, the similarity feature map between the feature map of the image to be detected and the feature map of the target image is calculated. That is, this application directly compares each region in the feature map of the image to be detected with the target image by calculating the similarity feature map to confirm whether there is a target to be processed in the image to be detected. Thus, the target detection model of this application only needs to learn how to determine the similarity between each region in the feature map of the image to be detected and the feature map of the target image. It is not necessary for the classification unit and detection unit in the target detection model to learn a large number of features of the target to be processed. In this case, the target detection model can be trained using existing datasets (public datasets, datasets that have been collected and labeled in the early stage), without the need to collect a large amount of target data for processing. This approach saves on manual data collection, annotation, and data storage costs. Furthermore, when adding new target types to existing target detection models, this application eliminates the need to collect a large amount of training material for the new target types; only a few dozen or even a few target images of the new target types are required. Therefore, this application's target detection model can support the detection of various target types. When a new target type is added, there is no need to update the network again, thus reducing the deployment and maintenance costs of the model. In addition, this application obtains multi-layer feature maps of the target image and the image to be detected through a feature extraction unit. Based on the similarity feature map between each layer of the image to be detected and the corresponding layer of the target image, it determines whether the image to be processed contains the target. Thus, the target detection result is obtained based on the similarity feature maps of all levels. This fusion of detection information obtained from semantic information at different levels improves the accuracy of the target detection results.

[0078] Specifically, such as Figure 3 As shown, this application provides another embodiment of a target detection method based on a target detection model. This target detection method based on a target detection model includes the following steps. It should be noted that the step numbers are for simplification only and are not intended to limit the execution order of the steps. The execution order of the steps in this embodiment can be arbitrarily changed without departing from the technical concept of this application.

[0079] S201: Acquire the target image and the image to be detected.

[0080] S202: Input the target image into a feature extraction unit with multi-layer output to obtain a multi-layer feature map of the target image.

[0081] S203: Input the image to be detected into the feature extraction unit to obtain a multi-layer feature map of the image to be detected.

[0082] S204: Based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, multiple similarity feature maps are obtained.

[0083] After obtaining the multi-layer feature maps of the image to be detected and the target image based on the above steps, as follows: Figure 4 As shown, the similarity feature map between the feature map of each layer of the image to be detected and the feature map of the corresponding layer of the target image can be calculated. In this way, multiple similarity feature maps between the image to be detected and the target image can be obtained, so that the final detection information of the target to be processed in the image to be detected can be confirmed based on multiple similarity feature maps.

[0084] It is understandable that in step S204, the similarity feature map calculation module (e.g.) Figure 4 The input of one end of the convolutional unit in the feature comparison network shown is the feature map of layer a of the image to be detected, and the input of the other end is the feature map of layer a of the target image. That is, in step S204, the similarity feature map between the feature map of layer a of the image to be detected and the feature map of layer a of the target image is calculated. Thus, through S204, the similarity feature map when a is 1 to N can be obtained, and multiple similarity feature maps between the image to be detected and the target image can be obtained.

[0085] In one implementation, the feature map of each layer of the target image can be used as the convolution kernel to perform convolution processing on the feature map of the corresponding layer of the image to be detected (this can be ordinary convolution processing or depthwise convolution processing). This results in multiple similarity feature maps between the image to be detected and the target image. This method is flexible and effective, enabling the feature comparison process to focus on image information of different feature layers, thereby improving the performance of target detection.

[0086] Specifically, the feature maps of each layer of the target image and the corresponding layer of the image to be detected can be input into a feature matching network. The network then uses the corresponding convolutional units to calculate a similarity feature map between the feature maps of each layer of the target image and the corresponding layer of the image to be detected. For example, such as... Figure 4 As shown, the multi-layer feature maps of the target image, the multi-layer feature maps of the image to be detected, and the multiple convolutional units in the feature comparison network correspond one-to-one. Thus, in step S204, each layer feature map of the target image and the corresponding layer feature map of the image to be detected can be input into the corresponding convolutional unit in the feature comparison network, so that each convolutional unit in the feature comparison network can calculate the similarity feature map between the corresponding layer feature map of the target image and the corresponding layer feature map of the image to be detected.

[0087] Preferably, all convolutional units in the feature matching network have the same parameters. Of course, in other alternative embodiments, the parameters of all convolutional units in the feature matching network may be different.

[0088] In another implementation, cosine similarity, Euclidean distance, KL divergence and other calculation methods can be used to calculate the similarity between each region in the feature map of the image to be detected and the feature map of the corresponding layer of the target image, thus obtaining multiple similarity feature maps between the image to be detected and the target image.

[0089] Before step S204, the resolution of the multi-layer feature maps of the target image can be normalized to ensure that the resolutions of the multi-layer feature maps are the same, thus facilitating the calculation of the similarity feature map in step S204. Specifically, the resolution of the multi-layer feature maps of the target image can be normalized by upsampling or downsampling them. For example, if the target image includes three feature maps with resolutions of 15*15, 7*7, and 3*3 respectively, the 3*3 feature map can be upsampled to 7*7, or the 15*15 feature map can be downsampled to 7*7, resulting in all three feature maps of the target image having a resolution of 7*7. Correspondingly, before step S204, the resolution of the multi-layer feature maps of the image to be detected can also be normalized to ensure that the resolutions of the multi-layer feature maps are the same, thus facilitating the calculation of the similarity feature map in step S204.

[0090] It is understood that the specific implementation of the above upsampling is not limited; for example, it can be implemented through convolution, or it can be implemented through interpolation. Similarly, the specific implementation of the above downsampling is also not limited; for example, it can be implemented through convolution, or it can also be implemented through interpolation.

[0091] If the multi-layer feature maps of the target image and the multi-layer feature maps of the image to be detected have been normalized before step S204, in step S204, the similarity feature maps of each layer of the feature map of the image to be detected and the corresponding layer of the feature map of the target image can be calculated based on the resolution-normalized multi-layer feature maps of the target image and the resolution-normalized multi-layer feature maps of the image to be detected, thereby obtaining multiple similarity feature maps between the image to be detected and the target image. For example, based on the three-layer feature map of the target image obtained in step S202 and the three-layer feature map of the image to be detected obtained in step S203, resolution normalization processing can be performed on the three-layer feature map of the target image and the three-layer feature map of the image to be detected. In step S204, the similarity feature map between the first-layer feature map of the target image after resolution normalization processing and the first-layer feature map of the image to be detected after resolution normalization processing can be calculated. The similarity feature map between the second-layer feature map of the target image after resolution normalization processing and the second-layer feature map of the image to be detected after resolution normalization processing can be calculated. The similarity feature map between the third-layer feature map of the target image after resolution normalization processing and the third-layer feature map of the image to be detected after resolution normalization processing can be calculated. In this way, three similarity feature maps of the target image and the image to be detected can be obtained.

[0092] Furthermore, prior to step S204, the multi-layer feature maps of the target image can be normalized to ensure that the feature dimensions of the multi-layer feature maps are completely consistent, thus facilitating subsequent calculation of similarity feature maps. Specifically, the resolution of the multi-layer feature maps of the target image can be normalized by performing convolution, normalization, and / or activation processing on the multi-layer feature maps of the target image. For example, assuming that the dimensions of the three-layer feature maps of the target image are 7*7*8, 7*7*16, and 7*7*32 after resolution normalization, convolution, normalization, and activation processing can be performed sequentially on the three-layer feature maps of the target image after resolution normalization to reduce the size of each of the three layers of the target image to 7*7*16.

[0093] Accordingly, before step S204, the multi-layer feature maps of the image to be detected can be normalized to ensure that the feature dimensions of the multi-layer feature maps of the image to be detected are completely consistent, so as to facilitate the subsequent calculation of similarity feature maps. Specifically, the resolution of the multi-layer feature maps of the image to be detected can be normalized by performing convolution processing, normalization processing and / or activation processing on the multi-layer feature maps of the image to be detected.

[0094] The normalization method described above can be Batch Normalization (BN). The activation method described above can be softmax or ReLU.

[0095] If resolution normalization and dimension normalization processing have been performed on the multi-layer feature maps of the target image and the multi-layer feature maps of the image to be detected before step S204, then in step S204, based on the three-layer feature maps of the target image after dimension normalization and the three-layer feature maps of the image to be detected after dimension normalization, the similarity feature maps of each layer feature map of the image to be detected and the corresponding layer feature maps of the target image can be calculated, thereby obtaining multiple similarity feature maps between the image to be detected and the target image. For example, based on step S202, a three-layer feature map of the target image is obtained, and then resolution normalization and dimension normalization are performed on the three-layer feature map of the target image. Based on step S203, a three-layer feature map of the image to be detected is obtained, and resolution normalization and dimension normalization are performed on the three-layer feature map of the image to be detected. In step S204, the similarity feature map between the first-layer feature map of the target image after dimension normalization and the first-layer feature map of the image to be detected after dimension normalization can be calculated. The similarity feature map between the second-layer feature map of the target image after dimension normalization and the second-layer feature map of the image to be detected after dimension normalization can be calculated. The similarity feature map between the third-layer feature map of the target image after dimension normalization and the third-layer feature map of the image to be detected after dimension normalization can be calculated. Thus, three similarity feature maps of the target image and the image to be detected can be obtained.

[0096] S205: Determine the detection information of the target to be processed in the image to be detected based on each similarity feature map.

[0097] After obtaining multiple similarity feature maps between the image to be detected and the target image based on the above steps, the detection information of the target to be processed in the image to be detected can be confirmed based on each similarity feature map.

[0098] In one implementation, the similarity feature map can be input into the detection unit, which then processes the similarity feature map to obtain (6*k*W). R *W R The detection feature map is used to determine whether there is a target to be processed within the k anchor boxes of each output position, as well as the location information of the target to be processed (i.e., the detection box coordinates of the target to be processed).

[0099] In this implementation, each output location in the detection feature map output by the detection unit contains 6*k data points. K data points represent the confidence level that the target to be processed is present within the k anchor boxes of that output location. Another k data points represent the confidence level that the target to be processed is not present within the k anchor boxes of that output location. Finally, 4*k data points represent the coordinate offset of the k anchor boxes of that output location. This allows for confirmation of whether the target to be processed is present within the k anchor boxes of each output location based on the detection feature map output by the detection unit, and determines the location information of the target to be processed. Specifically, if the maximum confidence level of the target to be processed within the k anchor boxes of an output location exceeds a threshold, the anchor box with the highest confidence level is used as the detection box of the target to be processed. This allows for the acquisition of the target's coordinate information and confirmation that the target to be processed is present within the anchor box with the highest confidence level among the k anchor boxes of that output location. In another alternative embodiment, the anchor box with the highest confidence of the target to be processed in the classification feature map can be used as the detection box of the target to be processed.

[0100] In this implementation, the detection information of the target to be processed may include the classification feature map output by the classification unit. Of course, in other alternative embodiments, the detection information of the target to be processed may include the location information of the target to be processed in the image to be detected and the confidence level that the target to be processed exists in the image to be detected.

[0101] In another implementation, the similarity feature map can be input into the classification unit, which then processes the similarity feature map to obtain (2*k*W). R *W R The classification feature map, output by the classification unit, can confirm whether there is a target to be processed within the k anchor boxes at each output location. The similarity feature map can also be input into the detection unit for processing, resulting in (4*k*W) R *W R The detection feature map output by the detection unit can be used to confirm the coordinate information of the anchor box containing the target to be processed (e.g., the coordinates of the top left and bottom right corners of the anchor box).

[0102] In this implementation, the detection information of the target to be processed may include the classification feature map output by the classification unit and the detection feature map output by the detection unit. Of course, in other alternative embodiments, the detection information of the target to be processed may include the location information of the target in the image to be detected and the confidence level of the presence of the target in the image to be detected.

[0103] Specifically, in this implementation, the similarity feature map can be input into the classification unit and detection unit of the feature comparison network; the classification unit in the feature comparison network can then process the similarity feature map to obtain (2*k*W). R *W R The detection feature map is obtained; and the similarity feature map is processed by the detection unit in the feature comparison network to obtain (4*k*W). R *W R The detection feature map of ). For example, such as Figure 4 As shown, the convolutional units, classification units, and detection units in the feature matching network correspond one-to-one. Thus, in step S205, the similarity feature map output by the convolutional unit can be input to the corresponding classification unit and detection unit, so that each classification unit in the feature matching network processes the similarity feature map output by the corresponding convolutional unit, and each detection unit in the feature matching network processes the similarity feature map output by the corresponding convolutional unit.

[0104] Preferably, the parameters of all classification units in the feature matching network are consistent, and / or the parameters of all detection units in the feature matching network are consistent. Of course, in other alternative embodiments, the parameters of all classification units in the feature matching network may be different, and the parameters of all detection units in the feature matching network may also be different.

[0105] The structure of the classification unit is unrestricted and can be set according to the actual situation. The classification unit may include a convolutional layer, a normalization layer, and / or an activation layer. Specifically, the classification unit may include a convolutional layer, a normalization layer, and an activation layer connected in sequence.

[0106] The structure of the detection unit is unrestricted and can be configured according to actual conditions. The detection unit may include a convolutional layer, a normalization layer, and / or an activation layer. Specifically, the detection unit may include a convolutional layer, a normalization layer, and an activation layer connected in sequence.

[0107] The 'k' mentioned above represents the number of anchor boxes at each output position, which can be arbitrarily set according to the range of values ​​for the anchor box's scale and ratio. For example, if the scale is [4,6,8,10,12] and the ratio is [0.3,0.6,2,3], then the value of K is 5*4=20.

[0108] S206: Fuse the obtained multiple detection information to determine whether the image to be detected contains the target to be processed.

[0109] After obtaining the detection information of the similarity feature map based on step S205, the detection information of multiple similarity feature maps can be fused to obtain the final detection information of the target to be processed in the image to be detected, thereby determining whether the image to be detected contains the target to be processed.

[0110] In one implementation, when the detection information includes feature maps, the feature maps of multiple similarity feature maps can be fused to obtain the final detection information. Based on the final detection information, the confidence information of the target to be processed in the image to be detected (e.g., the confidence that the target to be processed exists in the image to be detected) and the location information of the target to be processed can be confirmed. Furthermore, fusing the detection information at the feature level in this way can fuse detection features of different languages ​​to obtain more accurate detection results.

[0111] For example, assuming the detection information includes the classification feature map output by the classification unit and the detection feature map output by the detection unit, the classification feature maps output by the classification units corresponding to multiple similarity feature maps can be weighted and fused to obtain the final classification feature map; the detection feature maps output by the detection units corresponding to multiple similarity feature maps can be weighted and fused to obtain the final detection feature map. Thus, based on the second final detection feature map, the location information of the target detection box in the image to be detected can be known, and based on the final classification feature map, the confidence information of the target to be processed can be known (e.g., the confidence that the target to be processed is in the target detection box of the image to be detected).

[0112] In another implementation, when the detection information includes the location information of the target to be processed in the image to be detected and the confidence level of the presence of the target to be processed in the image to be detected, the target location information (i.e., target detection box information) corresponding to multiple similarity feature maps can be weighted and fused to obtain the final detection box information of the target to be processed; alternatively, the classification confidence levels (including the confidence level of the target to be processed and the confidence level of the absence of the target to be processed) corresponding to multiple similarity feature maps can be weighted and fused to obtain the final classification confidence level of the image to be detected.

[0113] In this embodiment, feature extraction units are used to extract features from the target image to obtain feature maps of the target image and the image to be detected. Then, a similarity feature map is calculated between the feature map of the image to be detected and the feature map of the target image. That is, this application directly compares each region in the feature map of the image to be detected with the target image by calculating the similarity feature map to confirm whether there is a target in each region of the image to be detected. Thus, the target detection model of this application only needs to learn how to determine the similarity between each region in the feature map of the image to be detected and the feature map of the target image. It is not necessary for the classification unit and detection unit in the target detection model to learn a large number of target features. In this case, the target detection model can be trained using existing datasets (public datasets, datasets that have been collected and labeled in the early stage), without the need to collect a large number of target datasets to be detected, which can save manual data collection and labeling. This application reduces data storage costs and, when adding new target types to existing target detection models, eliminates the need to collect a large amount of training material for the new target types. Instead, it only requires collecting a few dozen or even a few target images of the new target types. As a result, the target detection model of this application can support the detection of various target types. When a new target type is added, there is no need to update the network again, thus reducing the deployment and maintenance costs of the model. In addition, this application obtains multi-layer feature maps of the target image and the image to be detected through a feature extraction unit, and calculates the similarity feature map between each layer of the feature map of the image to be detected and the corresponding layer of the feature map of the target image. This allows the detection information of each layer (e.g., shallow, medium, or deep) to be obtained. Finally, the detection information of all layers is fused to obtain the final detection information of the target in the image to be detected. By fusing the detection information obtained from semantic information of different layers, the accuracy of the target detection results can be improved.

[0114] Optionally, considering that target images may differ significantly from different angles, for example... Figure 2 The top and front views of a spray bottle differ significantly. Therefore, the target detection method based on the target detection model described in this application can be used to perform target detection on the image to be detected, based on multiple target images and the image to be detected from different angles. Specifically, during the target detection process using the target detection model based on this application, multiple similarity feature maps are calculated between each target image and the image to be detected in multiple target images from different angles. Then, the detection result is determined based on each similarity feature map. The detection results of all similarity feature maps are fused to obtain the final detection result of the target in the image to be detected.

[0115] It is understandable that this application is made through, as Figure 5The object detection model shown implements the object detection method based on the above object detection model. To reduce the number of object images from different angles during object detection, the object detection model can be trained to adapt to changes in object size.

[0116] Specifically, during the iterative training of the object detection model, the size and position of the target image can be fixed, while the image to be detected is scaled and / or positionally transformed. This enables the object detection model to learn the ability to adapt to target size transformations, while also reducing the number of target images from different angles during inference and reducing the inference time of the object detection model.

[0117] For example, the implementation steps of the specific training method of the above scheme may include:

[0118] Step 1: Acquire the target image and the image to be detected;

[0119] Step 2: Input the target image into the feature extraction unit to obtain a multi-layer feature map of the target image; input the image to be detected into the feature extraction unit to obtain a multi-layer feature map of the image to be detected;

[0120] Step 3: Calculate the similarity feature map between the feature map of each layer of the image to be detected and the feature map of the corresponding layer of the target image, to obtain multiple similarity feature maps between the image to be detected and the target image;

[0121] Step 4: Confirm the detection information of the target to be processed in the image to be detected based on each of the similarity feature maps;

[0122] Step 5: Fuse the detection information of the multiple similarity feature maps to obtain the final detection information of the target to be processed in the image to be detected;

[0123] Step 6: Optimize the parameters of the target detection model based on the final detection information;

[0124] Step 7: Perform size transformation and / or position transformation on the image to be detected;

[0125] Step 8: Using the transformed image to be detected as the target image, return to step 2 to optimize the parameters of the target detection model again using the transformed image to be detected and the target image until the target detection model converges.

[0126] Optionally, the scaling transformation described above is not limited, and can be, for example, random scaling or fixed-size scaling. The position transformation described above is also not limited, and can be, for example, random rotation, horizontal flip transformation or vertical flip transformation.

[0127] In addition, during the training of the object detection model, the classification loss and regression loss of the object detection model can be calculated. Then, the total loss of the object detection model can be calculated using the classification loss and regression loss. Finally, the parameters of the object detection model can be optimized using the total loss to achieve the purpose of training the object detection model.

[0128] Alternatively, as shown in the following formula, the classification loss and regression loss can be weighted to obtain the total loss of the object detection model. loss ;

[0129] total loss =α*cls loss +β*reg loss ;

[0130] Among them, cls loss For classification loss, α is the weighting coefficient of the classification loss, and reg is the weighting coefficient of the classification loss. loss Let β be the regression loss, and β be the weighting coefficient of the regression loss. The weighting coefficients of the classification loss and regression loss can be set according to the actual situation, and there are no restrictions here.

[0131] The classification loss can be binary cross-entropy loss (softmax loss). The regression loss can be MAE(L1) loss, smooth L1 loss, MSE(L2) loss, etc.

[0132] In addition, the total loss can be used to backpropagate and update the parameters of the target detection model.

[0133] Please see Figure 6 , Figure 6 This is a schematic diagram of one embodiment of the electronic device 20 of this application. The electronic device 20 of this application includes a processor 22, which is used to execute instructions to implement the methods provided by any of the above embodiments of this application and any non-conflicting combinations thereof.

[0134] Electronic device 20 may be a camera device or a server, etc., and is not limited here.

[0135] Processor 22 can also be referred to as CPU (Central Processing Unit). Processor 22 may be an integrated circuit chip with signal processing capabilities. Processor 22 can also be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. A general-purpose processor can be a microprocessor, or processor 22 can be any conventional processor, etc.

[0136] The electronic device 20 may further include a memory 21 for storing instructions and data required for the processor 22 to run.

[0137] Please see Figure 7 , Figure 7 This is a schematic diagram of the structure of a computer-readable storage medium in an embodiment of this application. The computer-readable storage medium 30 in this embodiment stores instruction / program data 31. When executed, this instruction / program data 31 implements the methods provided in any embodiment of the above-described method of this application, as well as any non-conflicting combination thereof. The instruction / program data 31 can be formed into a program file and stored in the storage medium 30 in the form of a software product, so that a computer device (which may be a personal computer, server, or network device, etc.) or processor can execute all or part of the steps of the methods in various embodiments of this application. The aforementioned storage medium 30 includes various media capable of storing program code, such as a USB flash drive, portable hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, or devices such as computers, servers, mobile phones, and tablets.

[0138] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0139] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0140] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0141] The above are merely embodiments of this application and do not limit the scope of this patent application. Any equivalent structural or procedural changes made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of this application.

Claims

1. A target detection method based on a target detection model, characterized in that, The target detection model includes a feature extraction unit with multi-layer output, and the method includes: Acquire a target image and an image to be detected, wherein the target image contains the target to be processed; The target image is input into the feature extraction unit to obtain a multi-layer feature map of the target image; the image to be detected is input into the feature extraction unit to obtain a multi-layer feature map of the image to be detected. Using the feature map of each layer of the target image as the convolution kernel, convolution processing is performed on the feature map of the corresponding layer of the image to be detected to obtain a similar feature map between the feature map of each layer of the image to be detected and the feature map of the corresponding layer of the target image. Based on the similarity feature maps of each layer of the feature map of the image to be detected and the feature maps of the corresponding layer of the target image, it is determined whether the image to be detected contains the target to be processed; The step of determining whether the target to be processed is contained in the image to be detected based on the similarity feature maps of each layer of the feature map of the image to be detected and the feature maps of the corresponding layer of the target image includes: Based on the feature map of each layer of the image to be detected and the feature map of the corresponding layer of the target image, multiple similarity feature maps are obtained; The classification feature map of each similarity feature map is weighted and fused to obtain the final classification feature map of the image to be detected; and the detection feature map of each similarity feature map is weighted and fused to obtain the final detection feature map of the image to be detected. Based on the final classification feature map and the final detection feature map, it is determined whether the target to be processed is contained in the image to be detected.

2. The method according to claim 1, characterized in that, Each of the target detection models includes a classification unit and a detection unit, and the steps for obtaining the classification feature map and the detection feature map include: Each similarity feature map is input into the classification unit to obtain a classification feature map for each similarity feature map; each similarity feature map is input into the detection unit to obtain a detection feature map for each similarity feature map.

3. The method according to claim 1, characterized in that, The method further includes: If the target to be processed exists in the image to be detected, the location information of the target to be processed is output based on the final detection feature map, and the confidence information of the target to be processed is output based on the final classification feature map.

4. The method according to claim 1, characterized in that, The step of determining whether the image to be detected contains the target to be processed includes: Based on the determination result of whether the target to be processed is contained in the image to be detected, the parameters of the target detection model are optimized; The image to be detected is subjected to size transformation and / or position transformation; The transformed image to be detected is used as the image to be detected. The process of inputting the target image into the feature extraction unit to obtain the multi-layer feature map of the target image is then repeated. The image to be detected is then input into the feature extraction unit to obtain the multi-layer feature map of the image to be detected. The parameters of the target detection model are then optimized again using the transformed image to be detected and the target image until the target detection model converges.

5. The method according to claim 4, characterized in that, The optimization of the parameters of the target detection model based on the determination result of whether the target to be processed is contained in the image to be detected includes: Based on the determination result of whether the target to be processed is contained in the image to be detected, the classification loss and regression loss of the target detection model are calculated; Calculate the total loss based on the classification loss and the regression loss; The parameters of the target detection model are optimized based on the total loss.

6. The method according to claim 1, characterized in that, The feature extraction unit includes two feature extraction branches with identical structures, and the feature extraction branches have multiple outputs; The step of inputting the target image into the feature extraction unit to obtain a multi-layer feature map of the target image includes: The target image is input into a feature extraction branch of the feature extraction unit to obtain a multi-layer feature map of the target image; The step of inputting the image to be detected into the feature extraction unit to obtain a multi-layer feature map of the image to be detected includes: The image to be detected is input into another feature extraction branch of the feature extraction unit to obtain a multi-layer feature map of the image to be detected.

7. The method according to claim 1, characterized in that, The target detection model includes convolutional units. Based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, multiple similarity feature maps are obtained, including: Using the convolutional unit, the feature maps of the corresponding layers of the image to be detected are convolved with the feature maps of each layer of the target image as the convolutional kernel to obtain multiple similarity feature maps.

8. The method according to claim 1, characterized in that, The process involves obtaining multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layers of the target image, including the following: The resolution is normalized for the multi-layer feature map of the target image, and the resolution is normalized for the multi-layer feature map of the image to be detected. The method obtains multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, including: Multiple similarity feature maps are obtained based on the resolution-normalized feature maps of each layer of the target image and the resolution-normalized feature maps of the corresponding layers of the image to be detected.

9. The method according to claim 8, characterized in that, The process of performing resolution normalization processing on the multi-layer feature map of the target image and the multi-layer feature map of the image to be detected includes: The multi-layer feature map of the target image, which has undergone resolution normalization, is subjected to dimensionality normalization. The multi-layer feature map of the image to be detected, which has undergone resolution normalization, is also subjected to dimensionality normalization. The method obtains multiple similarity feature maps based on the feature maps of each layer of the image to be detected and the feature maps of the corresponding layer of the target image, including: Multiple similarity feature maps are obtained based on the feature maps of each layer after dimension normalization of the target image and the feature maps of the corresponding layers after dimension normalization of the image to be detected.

10. An electronic device, characterized in that, The electronic device includes a processor for executing instructions to implement the steps of the method as claimed in any one of claims 1-9.

11. A computer-readable storage medium having a program and / or instructions stored thereon, characterized in that, When the program and / or instructions are executed, they implement the steps of the method according to any one of claims 1-9.