A target detection method and related device

By employing a multi-level feature extraction and enhanced target detection method, combined with a cross-attention approach, the problem of insufficient target detection accuracy in existing technologies is solved, achieving higher accuracy in acquiring object position information.

CN117173626BActive Publication Date: 2026-06-19HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2023-07-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, neural network models obtain object location information based solely on feature fusion results during object detection, resulting in low output accuracy and an inability to accurately complete object detection.

Method used

By using a target model to extract and enhance multi-level features from the target image, and combining techniques such as cross-attention mechanism and reparameterized convolution, the feature fusion results are enhanced to obtain object position information, taking into account more comprehensive factors.

🎯Benefits of technology

It improves the accuracy of target detection, enabling more accurate acquisition of the object's position information in the image and meeting users' detection needs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117173626B_ABST
    Figure CN117173626B_ABST
Patent Text Reader

Abstract

This application discloses a target detection method and related equipment. The method considers a comprehensive range of factors during target detection, thus enabling accurate detection. The method includes: First, acquiring a target image containing the object to be detected and inputting the target image into a target model. Next, the target model extracts features from the target image to obtain a first feature, and further extracts features from the first feature to obtain a second feature. Then, the target model performs a first fusion of the first and second features to obtain a first fusion result. Subsequently, the target model enhances the first and second features based on the first fusion result to obtain enhanced first and second features. Finally, the target model uses the enhanced first and second features for detection to obtain the object's position information in the target image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to artificial intelligence (AI) technology, and more particularly to a target detection method and related equipment. Background Technology

[0002] Object detection, as a fundamental computer vision task, is increasingly needed in various scenarios. To meet users' object detection needs in diverse applications, neural network models from the AI ​​field can be used to perform object detection tasks, thereby providing the object detection results to users for viewing and use, thus improving the user experience.

[0003] In related technologies, when it is necessary to locate an object in a scene, a target image representing the scene is first acquired and input into a neural network model. The neural network model then extracts features from the target image, obtaining features at different levels. Next, the neural network model fuses these features to obtain a feature fusion result. Finally, the neural network model performs detection based on the feature fusion result, thereby obtaining the object's location information in the target image, which is equivalent to obtaining the object's location information within the scene.

[0004] In the above process, the neural network model directly obtains the object's position information based on the feature fusion result, which considers a relatively singular factor. This leads to low accuracy of the object's position information output by the model, making it unable to accurately complete target detection. Summary of the Invention

[0005] This application provides a target detection method and related equipment. The factors considered during the target detection process are comprehensive, so the final object position information has sufficient accuracy and can accurately complete the target detection.

[0006] A first aspect of this application provides a target detection method, which can be implemented through a target model, and the method includes:

[0007] When object detection is required for a certain scene, the scene can be photographed first to obtain a target image that represents the scene. The scene presented in the target image contains the object to be detected.

[0008] After obtaining the target image, it can be input into the target model. The target model can first extract features from the target image to obtain the first feature, and then further extract features from the first feature to obtain the second feature. It should be noted that the target model can extract features from multiple levels of the target image, and the first and second features can be features from two adjacent levels within those multiple levels. For example, the first feature might be the second-to-last level feature, and the second feature might be the last level feature.

[0009] After obtaining the first and second features, the target model can perform a first fusion on the first and second features to obtain a first fusion result. Using this first fusion result, the target model can enhance the first and second features, resulting in enhanced first and second features. With the enhanced first and second features, the target model can then perform detection, thereby obtaining and outputting the location information of these objects in the target image, which is equivalent to obtaining the location of these objects in the scene.

[0010] As can be seen from the above method, the target model obtains the position information of the object in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the position information of the object output by the target model has sufficient accuracy to accurately complete the target detection.

[0011] In one possible implementation, based on the first fusion result, the first feature and the second feature are enhanced to obtain the enhanced first feature and the enhanced second feature, including: injecting the first fusion result into the first feature to obtain the enhanced first feature, and determining the second feature as the enhanced second feature; or, injecting the first fusion result into the first feature to obtain the enhanced first feature, and injecting the first fusion result into the second feature to obtain the enhanced second feature; or, determining the first feature as the enhanced first feature, and injecting the first fusion result into the second feature to obtain the enhanced second feature. In the aforementioned implementation, the target model can complete data enhancement in a variety of ways: (1) Suppose that the target model only contains data enhancement functions for the first feature, so the target model can inject the first fusion result into the first feature to perform data enhancement on the first feature, thereby obtaining the enhanced first feature. Since the target model does not contain data enhancement functions for the second feature, the target model can not process the second feature and directly determine the second feature as the enhanced second feature. (2) Suppose the target model includes data augmentation functions for the first feature and the second feature. Therefore, the target model can inject the first fusion result into the first feature to perform data augmentation on the first feature, thereby obtaining the augmented first feature. Similarly, the target model can also inject the first fusion result into the second feature to perform data augmentation on the second feature, thereby obtaining the augmented second feature. (3) Suppose the target model only includes data augmentation functions for the second feature. Therefore, the target model can inject the first fusion result into the second feature to perform data augmentation on the second feature, thereby obtaining the augmented second feature. Since the target model does not include data augmentation functions for the first feature, the target model can directly determine the first feature as the augmented first feature without processing the first feature.

[0012] In one possible implementation, injecting the first fusion result into the first feature to obtain the enhanced first feature includes: processing the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature. In the aforementioned implementation, the target model can perform pointwise convolution on the first feature to obtain the sixth feature. Simultaneously, the target model can also perform pointwise convolution and activation function-based processing on the first fusion result to obtain the seventh feature. Simultaneously, the target model can also perform pointwise convolution only on the first fusion result to obtain the eighth feature. Then, the target model can multiply the sixth and seventh features to obtain the ninth feature, and add the eighth and ninth features to obtain the tenth feature. Finally, the target model performs reparameterized convolution on the tenth feature to obtain the eleventh feature, which is the enhanced first feature.

[0013] In one possible implementation, injecting the first fusion result into the second feature to obtain the enhanced second feature includes processing the first fusion result and the second feature based on a cross-attention mechanism. In the aforementioned implementation, the target model can perform pointwise convolution on the second feature to obtain the twelfth feature. Simultaneously, the target model can also perform pointwise convolution, activation function-based processing, and linear interpolation on the first fusion result to obtain the thirteenth feature. Simultaneously, the target model can also perform pointwise convolution and linear interpolation only on the first fusion result to obtain the fourteenth feature. Then, the target model can multiply the twelfth and thirteenth features to obtain the fifteenth feature, and add the fourteenth and fifteenth features to obtain the sixteenth feature. Finally, the target model performs reparameterized convolution on the sixteenth feature to obtain the seventeenth feature, which is the enhanced second feature.

[0014] In one possible implementation, the method further includes: preprocessing the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; and performing cross-attention mechanism-based processing on the first fusion result and the first feature to obtain an enhanced first feature, including: performing cross-attention mechanism-based processing on the first fusion result and the preprocessed first feature to obtain the enhanced first feature. In the aforementioned implementation, the target model can align the second feature to the first feature to obtain the eighteenth feature, and perform pointwise convolution on the first feature to obtain the nineteenth feature. Next, the target model can concatenate the eighteenth and nineteenth features to obtain the twentieth feature. Then, the target model can perform pointwise convolution on the twentieth feature to obtain the twenty-first feature, which is the preprocessed first feature. After obtaining the preprocessed first feature, the target model can perform pointwise convolution on the preprocessed first feature to obtain the sixth feature. Simultaneously, the target model can also perform pointwise convolution and activation function-based processing on the first fusion result to obtain the seventh feature. Simultaneously, the target model can perform pointwise convolution only on the first fusion result to obtain the eighth feature. Then, the target model can multiply the sixth and seventh features to obtain the ninth feature, and add the eighth and ninth features to obtain the tenth feature. Finally, the target model performs reparameterized convolution on the tenth feature to obtain the eleventh feature, which is the enhanced first feature.

[0015] In one possible implementation, the method further includes: preprocessing the second feature based on the first feature to obtain a preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; and performing cross-attention mechanism-based processing on the first fusion result and the second feature to obtain an enhanced second feature, including performing cross-attention mechanism-based processing on the first fusion result and the preprocessed second feature to obtain the enhanced second feature. In the aforementioned implementation, the target model can align the first feature to the second feature to obtain a twenty-second feature, and perform pointwise convolution on the second feature to obtain a twenty-third feature. Next, the target model can concatenate the twenty-second and twenty-third features to obtain a twenty-fourth feature. Then, the target model can perform pointwise convolution on the twenty-fourth feature to obtain a twenty-fifth feature, which is the preprocessed second feature. After obtaining the preprocessed second feature, the target model can perform pointwise convolution on the preprocessed second feature to obtain a twelfth feature. Simultaneously, the target model can also perform pointwise convolution, activation function-based processing, and linear interpolation on the first fusion result to obtain a thirteenth feature. Simultaneously, the target model can perform pointwise convolution and linear interpolation on only the first fusion result to obtain the fourteenth feature. Then, the target model can multiply the twelfth and thirteenth features to obtain the fifteenth feature, and add the fourteenth and fifteenth features to obtain the sixteenth feature. Finally, the target model performs reparameterized convolution on the sixteenth feature to obtain the seventeenth feature, which is the enhanced second feature.

[0016] In one possible implementation, the first fusion includes at least one of the following: alignment, concatenation, or convolution. In the aforementioned implementation, after obtaining the first feature and the second feature, the target model can first align the second feature to the first feature to obtain the third feature. Next, the target model can concatenate the first feature and the third feature to obtain the fourth feature. Then, the target model can convolve the fourth feature to obtain the fifth feature, which is the first fusion result.

[0017] In one possible implementation, the method further includes: performing a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; obtaining the position information of the object in the target image based on the enhanced first feature and the enhanced second feature includes: enhancing the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtaining the position information of the object in the target image based on the second enhanced first feature and the second enhanced second feature. In the aforementioned implementation, after obtaining the enhanced first feature and the enhanced second feature, the target model can perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result. After obtaining the second fusion result, the target model can use the second fusion result to enhance the enhanced first feature and the enhanced second feature, thereby obtaining a second enhanced first feature and a second enhanced second feature. After obtaining the second enhanced first feature and the second enhanced second feature, the target model can use the second enhanced first feature and the second enhanced second feature for detection, thereby obtaining and outputting the position information of these objects in the target image, which is equivalent to obtaining the position of these objects in the scene. Therefore, the target model can obtain the position information of the object in the target image based on the first feature after secondary enhancement and the second feature after secondary enhancement. The first feature after secondary enhancement is obtained based on the first feature, the first fusion result, and the second fusion result. The second feature after secondary enhancement is obtained based on the second feature, the first fusion result, and the second fusion result. The first feature and the second feature represent different local information of the target image. The first fusion result represents the low-dimensional global information of the target image, and the second fusion result represents the high-dimensional global information of the target image. Therefore, the target model considers more comprehensive factors in the process of target detection, so the position information of the object output by the target model can have higher accuracy and can more accurately complete the target detection.

[0018] In one possible implementation, the second fusion includes at least one of the following: alignment, concatenation, self-attention-based processing, feedforward network-based processing, or addition. In the aforementioned implementation, after obtaining the enhanced first feature and the enhanced second feature, the target model can first align the first feature to the second feature to obtain the twenty-sixth feature. Next, the target model can concatenate the second feature and the twenty-sixth feature to obtain the twenty-seventh feature. Then, the target model can perform self-attention-based processing, feedforward network-based processing, and addition on the twenty-seventh feature to obtain the twenty-eighth feature, which is the second fusion result.

[0019] A second aspect of this application provides a model training method, comprising: acquiring a training image containing an object to be detected; processing the training image using a model to be trained to obtain position information of the object in the training image, wherein the model to be trained is used to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; enhance the first feature and the second feature based on the first fusion result to obtain enhanced first feature and enhanced second feature; obtain position information based on the enhanced first feature and enhanced second feature; and train the model to be trained based on the position information and the actual position information of the object in the training image to obtain a target model.

[0020] The target model obtained by the above method has object detection capabilities. When object detection is required, a target image containing the object to be detected can be acquired and input into the target model. Next, the target model can extract features from the target image to obtain a first feature, and further extract features from the first feature to obtain a second feature. Then, the target model can perform a first fusion of the first and second features to obtain a first fusion result. Subsequently, the target model can enhance the first and second features based on the first fusion result to obtain enhanced first and second features. Finally, the target model can use the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0021] In one possible implementation, the model to be trained is used to: inject a first fusion result into a first feature to obtain an enhanced first feature, and determine a second feature as the enhanced second feature; or, inject a first fusion result into a first feature to obtain an enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature; or, determine a first feature as the enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature.

[0022] In one possible implementation, the model to be trained is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0023] In one possible implementation, the model to be trained is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0024] In one possible implementation, the model to be trained is further used to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0025] In one possible implementation, the model to be trained is further used to preprocess the second feature based on the first feature to obtain the preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0026] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0027] In one possible implementation, the model to be trained is further used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; the model to be trained is used to: enhance the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtain the position information of the object in the training image based on the second enhanced first feature and the second enhanced second feature.

[0028] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0029] A third aspect of this application provides a target detection apparatus comprising a target model. The apparatus includes: an acquisition module for acquiring a target image containing an object to be detected; an extraction module for extracting features from the target image to obtain a first feature, and further extracting features from the first feature to obtain a second feature; a fusion module for performing a first fusion on the first feature and the second feature to obtain a first fusion result; an enhancement module for enhancing the first feature and the second feature based on the first fusion result to obtain enhanced first features and enhanced second features; and a detection module for acquiring the position information of the object in the target image based on the enhanced first feature and enhanced second feature.

[0030] As can be seen from the above apparatus, when object detection is required, a target image containing the object to be detected can be acquired and input into the target model. Next, the target model can extract features from the target image to obtain a first feature, and further extract features from the first feature to obtain a second feature. Then, the target model can perform a first fusion of the first and second features to obtain a first fusion result. Subsequently, the target model can enhance the first and second features based on the first fusion result to obtain enhanced first and second features. Finally, the target model can use the enhanced first and second features for detection to obtain the object's position information in the target image. Thus, object detection is completed. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0031] In one possible implementation, the enhancement module is configured to: inject a first fusion result into a first feature to obtain an enhanced first feature, and determine a second feature as the enhanced second feature; or, inject a first fusion result into a first feature to obtain an enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature; or, determine a first feature as the enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature.

[0032] In one possible implementation, an enhancement module is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0033] In one possible implementation, an enhancement module is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0034] In one possible implementation, the device further includes: a first preprocessing model for preprocessing the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, splicing, or convolution; and an enhancement module for processing the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0035] In one possible implementation, the device further includes: a second preprocessing module for preprocessing the second feature based on the first feature to obtain a preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, splicing, or convolution; and an enhancement module for processing the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain an enhanced second feature.

[0036] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0037] In one possible implementation, the device further includes: a second fusion module for performing a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; and a detection module for: enhancing the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtaining the position information of the object in the target image based on the second enhanced first feature and the second enhanced second feature.

[0038] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0039] A fourth aspect of this application provides a model training apparatus, comprising: an acquisition module for acquiring a training image containing an object to be detected; a processing module for processing the training image using a model to be trained to obtain position information of the object in the training image, wherein the model to be trained is configured to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; enhance the first feature and the second feature based on the first fusion result to obtain enhanced first features and enhanced second features; and acquire position information based on the enhanced first features and enhanced second features; and a training module for training the model to be trained based on the position information and the actual position information of the object in the training image to obtain a target model.

[0040] The target model trained by the aforementioned device possesses target detection capabilities. When target detection is required, a target image containing the object to be detected can be acquired and input into the target model. Next, the target model can extract features from the target image to obtain a first feature, and further extract features from the first feature to obtain a second feature. Then, the target model can perform a first fusion of the first and second features to obtain a first fusion result. Subsequently, based on the first fusion result, the target model can enhance the first and second features to obtain enhanced first and second features. Finally, the target model can use the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the target detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0041] In one possible implementation, the model to be trained is used to: inject a first fusion result into a first feature to obtain an enhanced first feature, and determine a second feature as the enhanced second feature; or, inject a first fusion result into a first feature to obtain an enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature; or, determine a first feature as the enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature.

[0042] In one possible implementation, the model to be trained is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0043] In one possible implementation, the model to be trained is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0044] In one possible implementation, the model to be trained is further used to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0045] In one possible implementation, the model to be trained is further used to preprocess the second feature based on the first feature to obtain the preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0046] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0047] In one possible implementation, the model to be trained is further used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; the model to be trained is used to: enhance the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtain the position information of the object in the training image based on the second enhanced first feature and the second enhanced second feature.

[0048] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0049] A fifth aspect of this application provides a target detection apparatus, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the target detection apparatus performs the method described in the first aspect or any possible implementation thereof.

[0050] A sixth aspect of this application provides a model training apparatus, which includes a memory and a processor; the memory stores code, and the processor is configured to execute the code. When the code is executed, the model training apparatus performs the method described in the second aspect or any possible implementation thereof.

[0051] A seventh aspect of this application provides a circuit system including a processing circuit configured to perform the method described in the first aspect, any possible implementation of the first aspect, the second aspect, or any possible implementation of the second aspect.

[0052] An eighth aspect of this application provides a chip system including a processor for calling a computer program or computer instructions stored in a memory, such that the processor performs the method as described in the first aspect, any possible implementation of the first aspect, the second aspect, or any possible implementation of the second aspect.

[0053] In one possible implementation, the processor is coupled to the memory via an interface.

[0054] In one possible implementation, the chip system also includes a memory that stores computer programs or computer instructions.

[0055] A ninth aspect of this application provides a computer storage medium storing a computer program that, when executed by a computer, causes the computer to perform the method as described in the first aspect, any possible implementation of the first aspect, the second aspect, or any possible implementation of the second aspect.

[0056] A tenth aspect of this application provides a computer program product storing instructions that, when executed by a computer, cause the computer to perform the method as described in the first aspect, any possible implementation of the first aspect, the second aspect, or any possible implementation of the second aspect.

[0057] In this embodiment, when object detection is required, a target image containing the object to be detected can be acquired and input into a target model. The target model then extracts features from the target image to obtain a first feature, and further extracts features from the first feature to obtain a second feature. The target model then performs a first fusion of the first and second features to obtain a first fusion result. Subsequently, based on the first fusion result, the target model enhances the first and second features to obtain enhanced first and second features. Finally, the target model uses the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection. Attached Figure Description

[0058] Figure 1 A structural diagram illustrating the main framework of artificial intelligence;

[0059] Figure 2a A schematic diagram of the structure of the item recommendation system provided in the embodiments of this application;

[0060] Figure 2b Another structural diagram of the item recommendation system provided in the embodiments of this application;

[0061] Figure 2c A schematic diagram of the related equipment recommended for the articles provided in the embodiments of this application;

[0062] Figure 3 A schematic diagram of the system 100 architecture provided in the embodiments of this application;

[0063] Figure 4 A schematic diagram of the structure of the target model provided in the embodiments of this application;

[0064] Figure 5 A flowchart illustrating the item recommendation method provided in this application embodiment;

[0065] Figure 6 Another schematic diagram of the target model provided in the embodiments of this application;

[0066] Figure 7 A schematic diagram of the low-dimensional alignment module and low-dimensional fusion module provided in the embodiments of this application;

[0067] Figure 8 A schematic diagram of the injection module provided in an embodiment of this application;

[0068] Figure 9 This is another structural schematic diagram of the injection module provided in an embodiment of this application;

[0069] Figure 10 Another structural schematic diagram of the target model provided in the embodiments of this application;

[0070] Figure 11 A schematic diagram of the structure of the enhanced injection module provided in an embodiment of this application;

[0071] Figure 12 A schematic diagram of cross-layer information fusion processing provided in an embodiment of this application;

[0072] Figure 13 Another schematic diagram of the enhanced injection module provided in an embodiment of this application;

[0073] Figure 14 Another schematic diagram of cross-layer information fusion processing provided in the embodiments of this application;

[0074] Figure 15 Another structural schematic diagram of the target model provided in the embodiments of this application;

[0075] Figure 16 A flowchart illustrating the item recommendation method provided in this application embodiment;

[0076] Figure 17 Another schematic diagram of the target model provided in the embodiments of this application;

[0077] Figure 18 A schematic diagram of the high-dimensional alignment module and high-dimensional fusion module provided in the embodiments of this application;

[0078] Figure 19 A schematic diagram of the comparison results provided for an embodiment of this application;

[0079] Figure 20 A schematic flowchart of the model training method provided in the embodiments of this application;

[0080] Figure 21 A schematic diagram of the target detection device provided in the embodiments of this application;

[0081] Figure 22A schematic diagram of the structure of the model training apparatus provided in the embodiments of this application;

[0082] Figure 23 A schematic diagram of the structure of the execution device provided in the embodiments of this application;

[0083] Figure 24 A schematic diagram of the structure of the training device provided in the embodiments of this application;

[0084] Figure 25 This is a schematic diagram of the structure of a chip provided in an embodiment of this application. Detailed Implementation

[0085] This application provides a target detection method and related equipment. The factors considered during the target detection process are comprehensive, so the final object position information has sufficient accuracy and can accurately complete the target detection.

[0086] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same attributes in the embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a series of elements is not necessarily limited to those elements, but may include other elements not explicitly listed or inherent to those processes, methods, products, or apparatuses.

[0087] Object detection, as a fundamental computer vision task, is increasingly needed in various scenarios. To meet users' object detection needs in diverse applications (such as autonomous driving, intelligent security, robot navigation, and medical diagnosis), neural network models from the AI ​​field can be used to perform object detection tasks, thereby providing the object detection results to users for viewing and use, thus improving the user experience.

[0088] In related technologies, when it is necessary to locate an object in a scene, a target image representing the scene is first acquired and input into a neural network model. The neural network model may include a feature extraction module, a feature fusion module, and a detection module. Each layer of the feature extraction module extracts features from the target image and outputs the features obtained at each layer, i.e., features at different levels. Next, the feature fusion module fuses the features at different levels to obtain a feature fusion result. Then, the detection module performs detection based on the feature fusion result, thereby obtaining and outputting the object's position information in the target image, which is equivalent to obtaining the object's position information in the scene.

[0089] In the above process, the neural network model directly obtains the object's position information based on the feature fusion result, which considers a relatively singular factor. This leads to low accuracy of the object's position information output by the model, making it unable to accurately complete target detection.

[0090] To address the aforementioned problems, this application provides a target detection method that can be implemented in conjunction with artificial intelligence (AI) technology. AI technology is a discipline that uses digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence. AI technology achieves optimal results by perceiving the environment, acquiring knowledge, and using that knowledge. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new type of intelligent machine that can react in a way similar to human intelligence. Using artificial intelligence for data processing is a common application of AI.

[0091] First, the overall workflow of the artificial intelligence system is described; please refer to [link / reference]. Figure 1 , Figure 1 This is a structural diagram illustrating the main framework of artificial intelligence. The following explanation of the AI ​​framework is based on two dimensions: the "Intelligent Information Chain" (horizontal axis) and the "IT Value Chain" (vertical axis). The "Intelligent Information Chain" reflects a series of processes from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. In this process, data undergoes a condensation process of "data—information—knowledge—wisdom." The "IT Value Chain" reflects the value that artificial intelligence brings to the information technology industry, from the underlying infrastructure of human intelligence and information (provided and processed by technology) to the industrial ecosystem of the system.

[0092] (1) Infrastructure

[0093] Infrastructure provides computing power to support artificial intelligence systems, enabling communication with the external world and providing support through a basic platform. This communication occurs through sensors; computing power is provided by intelligent chips (hardware acceleration chips such as CPUs, NPUs, GPUs, ASICs, and FPGAs); and the basic platform includes distributed computing frameworks and related platform guarantees and support, which may include cloud storage and computing, interconnected networks, etc. For example, sensors communicate with the outside world to acquire data, and this data is provided to intelligent chips in the distributed computing system provided by the basic platform for computation.

[0094] (2) Data

[0095] The data at the next layer of infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, text, and IoT data from traditional devices, including business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.

[0096] (3) Data processing

[0097] Data processing typically includes methods such as data training, machine learning, deep learning, search, reasoning, and decision-making.

[0098] Among them, machine learning and deep learning can perform intelligent information modeling, extraction, preprocessing, and training of data by symbolizing and formalizing it.

[0099] Reasoning refers to the process in which, in a computer or intelligent system, the machine thinks and solves problems by simulating human intelligent reasoning, based on reasoning control strategies and using formalized information. Typical functions include search and matching.

[0100] Decision-making refers to the process of making decisions based on intelligent information after reasoning, and it typically provides functions such as classification, sorting, and prediction.

[0101] (4) General ability

[0102] After the data processing mentioned above, the results of the data processing can be used to form some general capabilities, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.

[0103] (5) Smart Products and Industry Applications

[0104] Intelligent products and industry applications refer to products and applications of artificial intelligence systems in various fields. They are the encapsulation of overall artificial intelligence solutions, productizing intelligent information decision-making and realizing practical applications. Their application areas mainly include: intelligent terminals, intelligent transportation, intelligent healthcare, autonomous driving, smart cities, etc.

[0105] The following section introduces several application scenarios for this application.

[0106] Figure 2a This is a schematic diagram of a target detection system provided in an embodiment of this application. The target detection system includes a user device and a data processing device. The user device includes smart terminals such as mobile phones, personal computers, or information processing centers. The user device is the initiator of the target detection request; typically, the request is initiated by the user through the user device.

[0107] The aforementioned data processing equipment can be cloud servers, network servers, application servers, management servers, or other devices or servers with data processing capabilities. The data processing equipment receives text processing requests from smart terminals through an interactive interface, and then performs text processing through a storage device for storing data and a processor for data processing, employing methods such as machine learning, deep learning, search, reasoning, and decision-making. The storage device in the data processing equipment can be a general term, including local storage and a database storing historical data. The database can be located on the data processing equipment or on other network servers.

[0108] exist Figure 2a In the target detection system shown, the user equipment can receive user instructions. For example, the user equipment can acquire an image input / selected by the user and then send a request to the data processing device, causing the data processing device to perform image processing applications on the image acquired by the user equipment, thereby obtaining the corresponding processing result. For instance, the user equipment can acquire a target image input by the user (used to present a scene containing objects to be detected), and then send a processing request to the data processing device for the target image, causing the data processing device to perform target detection-based processing on the target image, thereby obtaining the position information of the object in the target image, that is, the coordinates of the object in the image coordinate system (built based on the target image).

[0109] exist Figure 2a In this context, the data processing device can execute the target detection method of the embodiments of this application.

[0110] Figure 2b This is another structural schematic diagram of the target detection system provided in the embodiments of this application. Figure 2b In this context, the user equipment (UE) directly functions as a data processing device. This UE can directly acquire input from the user and process it directly through its own hardware. The specific process is similar to... Figure 2a Similar to the description above, it will not be repeated here.

[0111] exist Figure 2bIn the target detection system shown, the user equipment can acquire the target image input by the user (used to present a scene containing the object to be detected), and then perform target detection-based processing on the target image to obtain the position information of the object in the target image, that is, the coordinates of the object in the image coordinate system (built based on the target image).

[0112] exist Figure 2b In this context, the user equipment itself can execute the target detection method of the embodiments of this application.

[0113] Figure 2c This is a schematic diagram of a target detection device provided in an embodiment of this application.

[0114] The above Figure 2a and Figure 2b The user equipment in the context can specifically be Figure 2c Local device 301 or local device 302 in the system. Figure 2a The data processing equipment in the middle can specifically be Figure 2c The execution device 210 in the process includes a data storage system 250 that can store the data to be processed by the execution device 210. The data storage system 250 can be integrated into the execution device 210 or set up in the cloud or on other network servers.

[0115] Figure 2a and Figure 2b The processor in the image can be trained on data using neural network models or other models (e.g., support vector machine-based models) for machine learning / deep learning, and then use the trained or learned models to perform image processing applications on the image to obtain the corresponding processing results.

[0116] Figure 3 A schematic diagram of the system 100 architecture provided in this application embodiment, in Figure 3 In the process, the execution device 110 is configured with an input / output (I / O) interface 112 for data interaction with external devices. Users can input data to the I / O interface 112 through the client device 140. The input data in this embodiment may include various scheduled tasks, callable resources, and other parameters.

[0117] During the preprocessing of input data by the execution device 110, or during the calculation module 111 of the execution device 110 performing calculations and other related processing (such as implementing the neural network function in this application), the execution device 110 may call data, code, etc. in the data storage system 150 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 150.

[0118] Finally, I / O interface 112 returns the processing result to client device 140, thereby providing it to the user.

[0119] It is worth noting that the training device 120 can generate corresponding target models / rules based on different training data for different objectives or tasks. These target models / rules can then be used to achieve the aforementioned objectives or complete the aforementioned tasks, thereby providing the user with the required results. The training data can be stored in the database 130 and originates from training samples collected by the data acquisition device 160.

[0120] exist Figure 3 In the scenario shown, the user can manually provide input data, which can be done through the interface provided by I / O interface 112. Alternatively, the client device 140 can automatically send input data to I / O interface 112. If user authorization is required for the client device 140 to automatically send input data, the user can set the corresponding permissions in the client device 140. The user can view the output results of the execution device 110 on the client device 140, which can be presented in various forms such as display, sound, or animation. The client device 140 can also act as a data acquisition terminal, collecting the input data and output results of the input I / O interface 112 as new sample data and storing them in the database 130. Alternatively, data can be collected directly from the I / O interface 112 without going through the client device 140, using the input data and output results of the input I / O interface 112 as new sample data and storing them in the database 130.

[0121] It is worth noting that, Figure 3 This is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the diagram do not constitute any limitation. For example, in Figure 3 In this context, the data storage system 150 is an external memory relative to the execution device 110. However, in other cases, the data storage system 150 can also be placed within the execution device 110. For example... Figure 3 As shown, a neural network can be trained using training device 120.

[0122] This application also provides a chip including a neural network processor (NPU). This chip can be configured as follows: Figure 3 The execution device 110 shown is used to perform the calculations of the calculation module 111. This chip can also be placed in, for example... Figure 3 The training device 120 shown is used to complete the training work of the training device 120 and output the target model / rules.

[0123] The Neural Processing Unit (NPU) is a coprocessor mounted on the main central processing unit (CPU) (host CPU), where tasks are assigned by the CPU. The core of the NPU is the computation circuitry, which is controlled by a controller to retrieve data from memory (weight memory or input memory) and perform calculations.

[0124] In some implementations, the arithmetic circuitry includes multiple process engines (PEs). In some implementations, the arithmetic circuitry is a two-dimensional pulsating array. The arithmetic circuitry can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry is a general-purpose matrix processor.

[0125] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory and caches it in each PE (Process Equipment) of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory and performs matrix operations with matrix B. The partial or final result of the obtained matrix is ​​stored in the accumulator.

[0126] Vector computation units can further process the output of computational circuits, such as vector multiplication, vector addition, exponentiation, logarithmic operations, size comparisons, etc. For example, vector computation units can be used for computation in non-convolutional / non-FC layers of neural networks, such as pooling, batch normalization, and local response normalization.

[0127] In some implementations, the vector computation unit can store the processed output vector into a unified buffer. For example, the vector computation unit can apply a nonlinear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as activation input to the arithmetic circuit, for example, for use in subsequent layers of a neural network.

[0128] The unified memory is used to store input data and output data.

[0129] The weight data is directly transferred from the external memory to the input memory and / or unified memory, stored in the weight memory, and stored in the unified memory to the external memory through the direct memory access controller (DMAC).

[0130] The bus interface unit (BIU) is used to enable interaction between the main CPU, DMAC, and instruction fetch memory via a bus.

[0131] The instruction fetch buffer, connected to the controller, is used to store the instructions used by the controller.

[0132] The controller is used to invoke instructions cached in the memory to control the operation of the computing accelerator.

[0133] Generally, the unified memory, input memory, weight memory, and instruction fetch memory are all on-chip memories, while external memory is memory outside the NPU. This external memory can be double data rate synchronous dynamic random access memory (DDRSDRAM), high bandwidth memory (HBM), or other readable and writable memory.

[0134] Since the embodiments of this application involve a large number of neural network applications, for ease of understanding, the relevant terms and concepts such as neural networks involved in the embodiments of this application will be introduced below.

[0135] (1) Neural Network

[0136] A neural network can be composed of neural units, which can be operational units that take xs and an intercept of 1 as inputs, and whose output can be:

[0137]

[0138] Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be the sigmoid function. A neural network is a network formed by connecting many of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0139] The work of each layer in a neural network can be described by the mathematical expression y = a(Wx + b). From a physical perspective, the work of each layer in a neural network can be understood as transforming the input space (the set of input vectors) to the output space (i.e., from the row space to the column space of a matrix) through five operations on the input space. These five operations include: 1. Dimensionality increase / decrease; 2. Magnification / scaling; 3. Rotation; 4. Translation; 5. "Bending". Operations 1, 2, and 3 are performed by Wx, operation 4 by +b, and operation 5 by a(). The term "space" is used here because the objects being classified are not individual things, but a class of things, and space refers to the set of all individuals of this class of things. Here, W is the weight vector, and each value in this vector represents the weight value of a neuron in that layer of the neural network. This vector W determines the spatial transformation from the input space to the output space mentioned above; that is, the weights W of each layer control how the space is transformed. The purpose of training a neural network is to ultimately obtain the weight matrix of all layers of the trained neural network (a weight matrix formed by the vectors W of many layers). Therefore, the training process of a neural network is essentially about learning how to control the transformation space, and more specifically, learning the weight matrix.

[0140] Because we want the output of the neural network to be as close as possible to the actual predicted value, we can compare the current network's prediction with the desired target value, and then update the weight vector of each layer of the neural network based on the difference between the two (of course, there is usually an initialization process before the first update, that is, pre-configuring the parameters of each layer in the neural network). For example, if the network's prediction is too high, the weight vector is adjusted to make it predict lower, and this adjustment is continued until the neural network can predict the actual target value. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value," which is the loss function or objective function. These are important equations used to measure the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so training the neural network becomes the process of minimizing this loss as much as possible.

[0141] (2) Backpropagation algorithm

[0142] Neural networks can employ backpropagation (BP) to correct the parameters of the initial neural network model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates error loss; this error loss information is then propagated back to update the parameters of the initial neural network model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix.

[0143] The method provided in this application is described below from the perspectives of neural network training and neural network application.

[0144] The model training method provided in this application involves data sequence processing and can be applied to data training, machine learning, deep learning, and other methods. It performs symbolic and formal intelligent information modeling, extraction, preprocessing, and training on training data (e.g., the training images in this application) to ultimately obtain a trained neural network (such as the target model in this application). Furthermore, the target detection method provided in this application can utilize the trained neural network, inputting input data (e.g., the target image in this application) into the trained neural network to obtain output data (such as the position information of the object in the target image in this application). It should be noted that the model training method and the target detection method provided in this application are inventions based on the same concept and can be understood as two parts of a system or two stages of a whole process: such as the model training stage and the model application stage.

[0145] The target detection method provided in the embodiments of this application will be described below. The target detection method provided in the embodiments of this application can be implemented through a target model, which can have various structures. The first type of target model structure will be described below. Figure 4 A schematic diagram of the structure of the target model provided in the embodiments of this application, such as Figure 4 As shown, the target model includes a backbone network, a low-dimensional information aggregation-distribution branch, and a target detection head. The input of the backbone network serves as the input to the entire target model. The output of the backbone network is connected to the input of the low-dimensional information aggregation-distribution branch, and the output of the low-dimensional information aggregation-distribution branch is connected to the input of the target detection head. The output of the target detection head serves as the output of the entire target model. To further understand the workflow of the target model, the following section provides a more detailed description of this workflow. Figure 5 A flowchart illustrating the item recommendation method provided in this application embodiment, such as... Figure 5 As shown, the method includes:

[0146] 501. Obtain the target image, which contains the object to be detected.

[0147] In this embodiment, when it is necessary to locate an object in a certain scene, the scene can be photographed first to obtain a target image for presenting the scene. It can be seen that the scene presented by the target image contains the object to be detected (to be located).

[0148] 502. Perform feature extraction on the target image to obtain the first feature, and then perform feature extraction on the first feature to obtain the second feature.

[0149] After obtaining the target image, it can be input into the target model. Therefore, the target model can first extract features from the target image to obtain the first feature, and then perform further feature extraction on the first feature to obtain the second feature.

[0150] Specifically, the target model can obtain the first feature and the second feature in the following ways:

[0151] After obtaining the target image, since the backbone network of the target model contains multiple feature extraction layers, the first feature extraction layer of the backbone network can extract features from the target image, thus obtaining the first level of features. The second feature extraction layer of the backbone network can extract features from the first level of features, thus obtaining the second level of features, and so on. The last feature extraction layer of the backbone network can extract features from the penultimate level of features, thus obtaining the last level of features. Therefore, the backbone network can output features at multiple levels. Among these multiple levels of features, the features at earlier levels have larger dimensions, and the features at later levels have smaller dimensions.

[0152] Since the operations performed on features at each subsequent level are similar, the following description uses two adjacent levels of features as examples from a multi-level feature set for illustrative purposes. The feature at the earlier level is called the first feature, and the feature at the later level is called the second feature. The size of the first feature is larger than the size of the second feature. For example, the first feature is the feature at the first level, and the second feature is the feature at the second level. Another example is the first feature at the fifth level, and the second feature at the sixth level. Yet another example is the first feature at the second-to-last level, and the second feature at the last level, and so on.

[0153] After obtaining the first feature and the second feature, the backbone network can send the first feature and the second feature to the low-dimensional information aggregation-distribution branch.

[0154] For example, such as Figure 6 As shown ( Figure 6(This is another schematic diagram of the target model provided in an embodiment of this application). The target model includes a backbone network, a low-dimensional information aggregation-distribution branch, and a target detection head. The backbone network contains three feature extraction layers. After the target image is input into the backbone network of the target model, the backbone network can output features B3 (first level), B4 (second level), and B5 (third level), and send B3, B4, and B5 to the low-dimensional information aggregation-distribution branch. The size of B3 is larger than the size of B4, which is larger than the size of B5.

[0155] 503. Perform a first fusion on the first feature and the second feature to obtain the first fusion result.

[0156] After obtaining the first feature and the second feature, the target model can perform a first fusion (a certain feature fusion method) on the first feature and the second feature to obtain the first fusion result.

[0157] Specifically, the target model can obtain the first fusion result in the following way:

[0158] After obtaining the first and second features, the low-dimensional information aggregation-distribution branch first aligns the second feature to the first feature to obtain the third feature. It can be understood that the size of the first feature is the same as the size of the third feature. Next, the low-dimensional information aggregation-distribution branch concatenates the first and third features to obtain the fourth feature. Then, the low-dimensional information aggregation-distribution branch performs convolution on the fourth feature (e.g., based on reparameterized convolution), to obtain the fifth feature, which is the first fusion result.

[0159] It should be understood that since the low-dimensional information aggregation-distribution branch is used to obtain low-dimensional global information of the target image, that is, the texture features of the target image, these features are generally large in size. Therefore, when performing feature alignment, this branch tends to use the larger features as the alignment standard. That is, this branch usually uses the first feature as the alignment standard, so it usually aligns the second feature to the first feature. Of course, in some special cases (for example, the second feature is not a feature of the last level), this branch can also align the first feature to the second feature to obtain the third feature, and then concatenate and convolve the second and third features to obtain the first fusion result.

[0160] As in the example above, the low-dimensional information aggregation-distribution branch includes a low-dimensional alignment module, a low-dimensional fusion module, and an injection module. For example... Figure 7 As shown ( Figure 7(This is a schematic diagram of the low-dimensional alignment module and low-dimensional fusion module provided in the embodiments of this application). The low-dimensional alignment module can use B4 as the alignment standard and perform pooling on B3 to reduce the size of B3, obtaining a feature B3' aligned to B4, where the size of B3' is the same as the size of B4. Similarly, the low-dimensional alignment module can perform linear interpolation on B5 to enlarge the size of B5, obtaining a feature B5' aligned to B4, where the size of B5' is the same as the size of B4. Then, the low-dimensional alignment module can concatenate B3', B4, and B5' to obtain the concatenation result Fc, and send Fc to the low-dimensional fusion module. Then, the low-dimensional fusion module performs reparameterized convolution processing on Fc to obtain the low-dimensional fusion result Ffuse, and sends Ffuse to the injection module.

[0161] 504. Based on the first fusion result, the first feature and the second feature are enhanced to obtain the enhanced first feature and the enhanced second feature.

[0162] After obtaining the first fusion result, the target model can use the first fusion result to enhance the first feature and the second feature, thereby obtaining the enhanced first feature and the enhanced second feature.

[0163] Specifically, the target model can obtain the enhanced first feature and the enhanced second feature in several ways:

[0164] (1) Assume that the low-dimensional information aggregation-distribution branch only contains an injection module for the first feature. The injection module for the first feature can inject the first fusion result into the first feature to perform data augmentation on the first feature, thereby obtaining the augmented first feature. Since the low-dimensional information aggregation-distribution branch does not contain an injection module for the second feature, this branch can not process the second feature and directly determine the second feature as the augmented second feature.

[0165] (2) Assume that the low-dimensional information aggregation-distribution branch includes an injection module for the first feature and an injection module for the second feature. The injection module for the first feature can inject the first fusion result into the first feature to perform data augmentation on the first feature, thereby obtaining the augmented first feature. Similarly, the injection module for the second feature can inject the first fusion result into the second feature to perform data augmentation on the second feature, thereby obtaining the augmented second feature.

[0166] (3) Assume that the low-dimensional information aggregation-distribution branch only contains an injection module for the second feature. Since the low-dimensional information aggregation-distribution branch does not contain an injection module for the first feature, this branch can directly determine the first feature as the enhanced first feature without processing the first feature. The injection module for the second feature can inject the first fusion result into the second feature to perform data augmentation on the second feature, thereby obtaining the enhanced second feature.

[0167] Continuing with the example above, suppose the low-dimensional information aggregation-distribution branch includes an injection module for B4 and an injection module for B5. Then, this branch can directly determine B3 as the enhanced third-level feature P3. The injection module for B4 can inject Ffuse into B4 to obtain the enhanced second-level feature P4. The injection module for B5 can inject Ffuse into B5 to obtain the enhanced first-level feature P5.

[0168] More specifically, the injection module can have various structures; the first type of injection module will be introduced below. The first type of injection module (the ordinary type) can obtain the enhanced first feature and the enhanced second feature in the following way:

[0169] (1) After obtaining the first fusion result, the injection module for the first feature can process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature. It should be noted that since the first feature is used as the alignment standard during the acquisition of the first fusion result, the size of the first fusion result is the same as the size of the first feature. Then, the injection module can perform pointwise convolution on the first feature to obtain the sixth feature. At the same time, the injection module can also perform pointwise convolution and activation function-based processing on the first fusion result to obtain the seventh feature. Meanwhile, the injection module can also perform pointwise convolution only on the first fusion result to obtain the eighth feature (in this way, the size of the sixth feature, the size of the seventh feature, and the size of the eighth feature are the same). Then, the injection module can multiply the sixth feature and the seventh feature to obtain the ninth feature, and add the eighth feature and the ninth feature to obtain the tenth feature. Finally, the injection module performs reparameterized convolution processing on the tenth feature to obtain the eleventh feature, which is the enhanced first feature.

[0170] (2) After obtaining the first fusion result, the injection module for the second feature can process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature. It should be noted that since the first feature is used as the alignment standard during the acquisition of the first fusion result, the size of the first fusion result is the same as the size of the first feature. Then, the injection module can perform pointwise convolution on the second feature to obtain the twelfth feature. At the same time, the injection module can also perform pointwise convolution, activation function-based processing, and linear interpolation on the first fusion result to obtain the thirteenth feature. Meanwhile, the injection module can also perform pointwise convolution and linear interpolation only on the first fusion result to obtain the fourteenth feature (in this way, the size of the twelfth feature, the thirteenth feature, and the fourteenth feature are the same). Then, the injection module can multiply the twelfth and thirteenth features to obtain the fifteenth feature, and add the fourteenth and fifteenth features to obtain the sixteenth feature. Finally, the injection module performs reparameterized convolution processing on the sixteenth feature to obtain the seventeenth feature, which is the enhanced second feature.

[0171] As in the example above, such as Figure 8 and Figure 9 As shown ( Figure 8 This is a schematic diagram of the injection module provided in an embodiment of this application. Figure 9 (This is another structural diagram of the injection module provided in an embodiment of this application). After obtaining Ffuse, the injection module for B4 can first perform pointwise convolution (also called 1×1 convolution) on B4 to obtain feature Q4. Simultaneously, the injection module for B4 can also perform pointwise convolution on Ffuse and activation function-based processing (implemented through the sigmoid function) to obtain feature K4. At the same time, the injection module for B4 can also perform pointwise convolution on Ffuse to obtain feature V4 (Q4, K4, and V4 have the same size). Next, the injection module for B4 can multiply Q4 and K4, and then add the result to V4 to obtain feature A4. Finally, the injection module for B4 can perform reparameterized convolution processing on A4 to obtain feature P4.

[0172] After obtaining Ffuse, the injection module for B5 first performs pointwise convolution on B5 to obtain feature Q5. Simultaneously, the injection module for B5 also performs pointwise convolution, activation function-based processing, and linear interpolation on Ffuse to obtain feature K5. At the same time, the injection module for B5 also performs pointwise convolution and linear interpolation on Ffuse to obtain feature V5 (Q5, K5, and V5 have the same size). Next, the injection module for B5 multiplies Q5 and K5, then adds the result to V5 to obtain feature A5. Finally, the injection module for B5 performs reparameterized convolution on A5 to obtain feature P5.

[0173] More specifically, the injection module with the second structure (the enhanced injection module) can obtain the enhanced first feature and the enhanced second feature in the following ways:

[0174] (1) After obtaining the first fusion result, the injection module for the first feature can first preprocess the first feature based on the second feature (cross-layer information fusion) to obtain the preprocessed first feature. It should be noted that the injection module for the first feature can align the second feature to the first feature (i.e., perform linear interpolation on the second feature) to obtain the eighteenth feature, and perform pointwise convolution on the first feature to obtain the nineteenth feature. Then, the injection module for the first feature can concatenate the eighteenth and nineteenth features to obtain the twentieth feature. Then, the injection module for the first feature can perform pointwise convolution on the twentieth feature to obtain the twenty-first feature, which is the preprocessed first feature.

[0175] After obtaining the preprocessed first feature, the injection module can process the first fusion result and the preprocessed first feature using a cross-attention mechanism to obtain the enhanced first feature. It should be noted that the injection module can perform pointwise convolution on the preprocessed first feature to obtain the sixth feature. Simultaneously, the injection module can also perform pointwise convolution and activation function-based processing on the first fusion result to obtain the seventh feature. Furthermore, the injection module can also perform pointwise convolution only on the first fusion result to obtain the eighth feature (in this case, the sizes of the sixth, seventh, and eighth features are the same). Then, the injection module can multiply the sixth and seventh features to obtain the ninth feature, and add the eighth and ninth features to obtain the tenth feature. Finally, the injection module performs reparameterized convolution on the tenth feature to obtain the eleventh feature, which is the enhanced first feature.

[0176] (2) After obtaining the first fusion result, the injection module for the second feature can first preprocess the second feature based on the first feature (cross-layer information fusion) to obtain the preprocessed second feature. It should be noted that the injection module for the second feature can align the first feature to the second feature (i.e., pool the first feature) to obtain the twenty-second feature, and then perform pointwise convolution on the second feature to obtain the twenty-third feature. Next, the injection module for the second feature can concatenate the twenty-second and twenty-third features to obtain the twenty-fourth feature. Then, the injection module for the second feature can perform pointwise convolution on the twenty-fourth feature to obtain the twenty-fifth feature, which is the preprocessed second feature.

[0177] After obtaining the preprocessed second feature, the injection module for the second feature can process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature. It should be noted that the injection module can perform pointwise convolution on the preprocessed second feature to obtain the twelfth feature. Simultaneously, the injection module can also perform pointwise convolution, activation function-based processing, and linear interpolation on the first fusion result to obtain the thirteenth feature. At the same time, the injection module can also perform pointwise convolution and linear interpolation only on the first fusion result to obtain the fourteenth feature (in this way, the sizes of the twelfth, thirteenth, and fourteenth features are the same). Then, the injection module can multiply the twelfth and thirteenth features to obtain the fifteenth feature, and add the fourteenth and fifteenth features to obtain the sixteenth feature. Finally, the injection module performs reparameterized convolution on the sixteenth feature to obtain the seventeenth feature, which is the enhanced second feature.

[0178] As in the example above, such as Figure 10 As shown ( Figure 10 (This is another structural diagram of the target model provided in an embodiment of this application). Assume the low-dimensional information aggregation-distribution branch includes a low-dimensional alignment module, a low-dimensional fusion module, and an enhanced injection module. Then, the input to the injection module for B4 includes not only B4 and Ffuse, but also B5 and B3. The input to the injection module for B5 includes not only B5 and Ffuse, but also B4.

[0179] like Figure 11 and Figure 12 As shown ( Figure 11 This is a schematic diagram of the structure of the enhanced injection module provided in an embodiment of this application. Figure 12(This is a schematic diagram of cross-layer information fusion processing provided in an embodiment of this application). After obtaining Ffuse, the injection module for B4 can first perform cross-layer information fusion processing on B3, B4, and B5. That is, the injection module for B4 first performs pointwise convolution on B4 to obtain feature C4, pools B3 to obtain feature C3, and performs linear interpolation on B5 to obtain feature C5. Then, the injection module for B4 can concatenate C3, C4, and C5 and perform pointwise convolution to obtain feature C4'.

[0180] After obtaining C4', the injection module for B4 performs pointwise convolution on C4' to obtain feature Q4. Simultaneously, the injection module for B4 also performs pointwise convolution on Ffuse and activation function-based processing to obtain feature K4. Furthermore, the injection module for B4 performs pointwise convolution on Ffuse to obtain feature V4 (Q4, K4, and V4 have the same size). Next, the injection module for B4 multiplies Q4 and K4, then adds the result to V4 to obtain feature A4. Finally, the injection module for B4 performs reparameterized convolution on A4 to obtain feature P4.

[0181] like Figure 13 and Figure 14 As shown ( Figure 13 This is another schematic diagram of the enhanced injection module provided in an embodiment of this application. Figure 14 (This is another schematic diagram of the cross-layer information fusion processing provided in the embodiments of this application). After obtaining Ffuse, the injection module for B5 can first perform cross-layer information fusion processing on B4 and B5. That is, the injection module for B5 first performs pointwise convolution on B5 to obtain feature D5, and pools B4 to obtain feature D4. Then, the injection module for B5 can concatenate D5 and D4 and perform pointwise convolution to obtain feature D5'.

[0182] After obtaining D5', the injection module for B5 can first perform pointwise convolution on D5' to obtain feature Q5. Simultaneously, the injection module for B5 can also perform pointwise convolution, activation function-based processing, and linear interpolation on Ffuse to obtain feature K5. At the same time, the injection module for B5 can also perform pointwise convolution and linear interpolation on Ffuse to obtain feature V5 (Q5, K5, and V5 have the same size). Next, the injection module for B5 can multiply Q5 and K5, and then add the result to V5 to obtain feature A5. Finally, the injection module for B5 can perform reparameterized convolution on A5 to obtain feature P5.

[0183] 505. Based on the enhanced first feature and the enhanced second feature, obtain the position information of the object in the target image.

[0184] After obtaining the enhanced first feature and the enhanced second feature, the target model can use the enhanced first feature and the enhanced second feature to perform detection, thereby obtaining and outputting the position information of these objects in the target image, that is, the coordinates of these objects in the image coordinate system (constructed based on the target image), which is equivalent to obtaining the position of these objects in the scene.

[0185] Specifically, the target model can obtain the position information of the object in the target image in the following ways:

[0186] After obtaining the enhanced first feature and the enhanced second feature, the target detection head can process the enhanced first feature and the enhanced second feature (e.g., convolution and fully connected layers) to obtain the position information of these objects in the target image.

[0187] The above is a detailed description of the target model of the first structure. The target model of the second structure will be introduced below. Figure 15 Another structural schematic diagram of the target model provided in the embodiments of this application, such as Figure 15 As shown, the target model includes a backbone network, a low-dimensional information aggregation-distribution branch, a high-dimensional information aggregation-distribution branch, and a target detection head. The input of the backbone network serves as the input to the entire target model. The output of the backbone network is connected to the input of the low-dimensional information aggregation-distribution branch, the output of the low-dimensional information aggregation-distribution branch is connected to the input of the high-dimensional information aggregation-distribution branch, and the output of the high-dimensional information aggregation-distribution branch is connected to the input of the target detection head. The output of the target detection head serves as the output of the entire target model. To further understand the workflow of the target model, the following section provides a more detailed description of this workflow. Figure 16 A flowchart illustrating the item recommendation method provided in this application embodiment, such as... Figure 16 As shown, the method includes:

[0188] 1601. Obtain the target image, which contains the object to be detected.

[0189] 1602. Perform feature extraction on the target image to obtain the first feature, and then perform feature extraction on the first feature to obtain the second feature.

[0190] 1603. Perform a first fusion on the first feature and the second feature to obtain the first fusion result.

[0191] 1604. Based on the first fusion result, the first feature and the second feature are enhanced to obtain the enhanced first feature and the enhanced second feature.

[0192] For a description of steps 1601 to 1604, please refer to [link / reference]. Figure 5 The relevant descriptions of steps 501 to 504 in the illustrated embodiment will not be repeated here.

[0193] 1605. Perform a second fusion on the enhanced first feature and the enhanced second feature to obtain the second fusion result.

[0194] After obtaining the enhanced first feature and the enhanced second feature, the target model can perform a second fusion (another feature fusion method) on the enhanced first feature and the enhanced second feature to obtain the second fusion result.

[0195] Specifically, the target model can obtain the first fusion result in the following way:

[0196] After obtaining the enhanced first and second features, the high-dimensional information aggregation-distribution branch first aligns the first feature to the second feature, thus obtaining the twenty-sixth feature. It can be understood that the size of the second feature is the same as the size of the twenty-sixth feature. Next, the high-dimensional information aggregation-distribution branch concatenates the second and twenty-sixth features to obtain the twenty-seventh feature. Then, the high-dimensional information aggregation-distribution branch performs self-attention-based processing, feedforward network-based processing, and summation on the twenty-seventh feature to obtain the twenty-eighth feature, which is the second fusion result.

[0197] It should be understood that since the high-dimensional information aggregation-distribution branch is used to obtain high-dimensional global information of the target image, that is, the structural features of the target image, these features are generally small in size. Therefore, when performing feature alignment, this branch tends to use the smaller feature as the alignment standard. That is, this branch usually uses the second feature as the alignment standard, so this branch usually aligns the first feature to the second feature.

[0198] For example, such as Figure 17 As shown ( Figure 17 (This is another schematic diagram of the target model provided in the embodiments of this application). The target model includes a backbone network, a low-dimensional information aggregation-distribution branch, a high-dimensional information aggregation-distribution branch, and a target detection head. After the backbone network outputs B3, B4, and B5 to the low-dimensional information aggregation-distribution branch, the low-dimensional information aggregation-distribution branch can output P3, P4, and P5 to the high-dimensional information aggregation-distribution branch. The size of P3 is greater than the size of P4, which is greater than the size of P5.

[0199] The high-dimensional information aggregation-distribution branch includes a high-dimensional alignment module, a high-dimensional fusion module, and an injection module. For example... Figure 18 As shown ( Figure 18 (This is a schematic diagram of the high-dimensional alignment module and high-dimensional fusion module provided in the embodiments of this application). The high-dimensional alignment module uses P5 as the alignment standard and pools P3 and P4 to reduce the size of P3 and P4, obtaining features P3' and P4' aligned to P5, where the size of P3' and P4' is the same as the size of P5. Then, the high-dimensional alignment module concatenates P3', P4', and P5 to obtain the concatenation result Fu, and sends Fu to the high-dimensional fusion module. Then, the high-dimensional fusion module processes Fc based on a self-attention mechanism, a feedforward network, and adds them together to obtain the high-dimensional fusion result F', and sends F' to the injection module.

[0200] 1606. Based on the second fusion result, the enhanced first feature and the enhanced second feature are enhanced to obtain the enhanced first feature and the enhanced second feature after two enhancements.

[0201] After obtaining the second fusion result, the target model can use the second fusion result to enhance the enhanced first feature and the enhanced second feature, thereby obtaining the enhanced first feature and the enhanced second feature after two enhancements.

[0202] Specifically, the target model can obtain the enhanced first feature and the enhanced second feature in several ways:

[0203] (1) Assume that the high-dimensional information aggregation-distribution branch only contains an injection module for the first feature. The injection module for the first feature can inject the second fusion result into the enhanced first feature to perform data augmentation on the enhanced first feature, thereby obtaining the second enhanced first feature. Since the high-dimensional information aggregation-distribution branch does not contain an injection module for the second feature, this branch can not process the enhanced second feature and directly determine the enhanced second feature as the second enhanced second feature.

[0204] (2) Assume the high-dimensional information aggregation-distribution branch includes an injection module for the first feature and an injection module for the second feature. The injection module for the first feature can inject the second fusion result into the enhanced first feature to perform data augmentation on the enhanced first feature, thereby obtaining the second enhanced first feature. Similarly, the injection module for the second feature can inject the second fusion result into the enhanced second feature to perform data augmentation on the enhanced second feature, thereby obtaining the second enhanced second feature.

[0205] (3) Assume that the high-dimensional information aggregation-distribution branch only contains an injection module for the second feature. Since the high-dimensional information aggregation-distribution branch does not contain an injection module for the first feature, this branch can directly determine the enhanced first feature as the second enhanced first feature without processing the enhanced first feature. The injection module for the second feature can inject the second fusion result into the enhanced second feature to perform data enhancement on the enhanced second feature, thereby obtaining the second enhanced second feature.

[0206] More specifically, the injection module can have various structures; the first type of injection module will be introduced below. The first type of injection module (the ordinary type) can obtain the first enhanced feature and the second enhanced feature through the following methods:

[0207] (1) After obtaining the second fusion result, the injection module for the first feature can process the second fusion result and the enhanced first feature based on the cross attention mechanism to obtain the enhanced first feature.

[0208] (2) After obtaining the second fusion result, the injection module for the second feature can process the second fusion result and the enhanced second feature based on the cross attention mechanism to obtain the enhanced second feature.

[0209] More specifically, the injection module with the second structure (the enhanced injection module) can obtain the first feature and the second feature after secondary enhancement in the following way:

[0210] (1) After obtaining the second fusion result, the injection module for the first feature can first preprocess the enhanced first feature based on the enhanced second feature (cross-layer information fusion) to obtain the preprocessed and enhanced first feature. After obtaining the preprocessed and enhanced first feature, the injection module for the first feature can process the second fusion result and the preprocessed and enhanced first feature based on the cross-attention mechanism to obtain the second enhanced first feature.

[0211] (2) After obtaining the second fusion result, the injection module for the second feature can first preprocess the enhanced second feature based on the enhanced first feature (cross-layer information fusion) to obtain the preprocessed and enhanced second feature. After obtaining the preprocessed and enhanced second feature, the injection module for the second feature can process the second fusion result and the preprocessed and enhanced second feature based on the cross-attention mechanism to obtain the second feature after secondary enhancement.

[0212] For an explanation of step 1606, please refer to the relevant explanation section of step 1604; it will not be repeated here.

[0213] 1607. Based on the first feature after secondary enhancement and the second feature after secondary enhancement, obtain the position information of the object in the image.

[0214] After obtaining the first and second enhanced features, the target model can use them for detection, thereby obtaining and outputting the position information of these objects in the target image, that is, the coordinates of these objects in the image coordinate system (constructed based on the target image), which is equivalent to obtaining the position of these objects in the scene.

[0215] Specifically, the target model can obtain the position information of the object in the target image in the following ways:

[0216] After obtaining the first and second enhanced features, the target detection head can process them (e.g., through convolution and fully connected layers) to obtain the location information of these objects in the target image.

[0217] Furthermore, the target model provided in the embodiments of this application (e.g., Figure 19 GD-YOLO and related technology models (e.g., Figure 19 The results were compared with those of YOLO (YoLoop). Figure 19 As shown ( Figure 19 (A schematic diagram of the comparison results provided for embodiments of this application). Based on Figure 19 As shown in the table, the performance of the target model provided in this application embodiment is superior to that of the models in related technologies.

[0218] In this embodiment, when object detection is required, a target image containing the object to be detected can be acquired and input into a target model. The target model then extracts features from the target image to obtain a first feature, and further extracts features from the first feature to obtain a second feature. The target model then performs a first fusion of the first and second features to obtain a first fusion result. Subsequently, based on the first fusion result, the target model enhances the first and second features to obtain enhanced first and second features. Finally, the target model uses the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0219] Furthermore, in this embodiment, the target model can fuse the enhanced first feature and the enhanced second feature to obtain a second fusion result. Based on the second fusion result, the enhanced first feature and the enhanced second feature are further enhanced to obtain a second enhanced feature. The position information of the object in the target image is then obtained based on this second enhanced feature. Since the first feature and the second feature represent different local information of the target image, the first fusion result represents low-dimensional global information of the target image, and the second fusion result represents high-dimensional global information of the target image, the target model considers more comprehensive factors during target detection. Therefore, the final object position information output by the target model can have higher accuracy and more accurately complete target detection.

[0220] Furthermore, in this embodiment, the target model includes a low-dimensional information aggregation-distribution branch and a high-dimensional information aggregation-distribution branch. These two branches include injection modules for the first feature and / or the second feature. The number of injection modules is optional. This ensures both the accuracy and speed of the target model in target detection. The flexible selection of injection modules allows for a balance between the accuracy and speed of target detection.

[0221] Furthermore, in this embodiment, the injection module has various structures. A standard injection module can inject feature fusion results into features at different levels, improving the model's utilization of both global and local information, thereby enhancing the performance of the target model. An enhanced injection module can not only inject feature fusion results into features at different levels, but also fuse features from adjacent layers with features from the current layer, enhancing the flow and fusion of cross-layer information, which is beneficial for further improving the performance of the target model.

[0222] The above is a detailed description of the target detection method provided in the embodiments of this application. The model training method provided in the embodiments of this application will be introduced below. Figure 20 A schematic flowchart of the model training method provided in the embodiments of this application is shown below. Figure 20 As shown, the method includes:

[0223] 2001. Obtain training images containing the objects to be detected.

[0224] In this embodiment, when training the model to be trained is required, a batch of training data can be obtained first. This batch of training data includes training images, and the training images contain the objects to be detected. It should be noted that the actual position information of the objects to be detected in the training images is known.

[0225] 2002. The training image is processed by the model to be trained to obtain the position information of the object in the training image. The model to be trained is used to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and obtain position information based on the enhanced first feature and the enhanced second feature.

[0226] After obtaining the training images, they can be input into the model to be trained. The model can first extract features from the training images to obtain a first feature, and then extract further features from the first feature to obtain a second feature. Next, the model can perform a first fusion of the first and second features to obtain a first fusion result. Then, based on the first fusion result, the model can enhance the first and second features to obtain enhanced first and second features. Finally, based on the enhanced first and second features, the model can obtain the (predicted) location information of the object in the training image.

[0227] In one possible implementation, the model to be trained is used to: inject a first fusion result into a first feature to obtain an enhanced first feature, and determine a second feature as the enhanced second feature; or, inject a first fusion result into a first feature to obtain an enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature; or, determine a first feature as the enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature.

[0228] In one possible implementation, the model to be trained is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0229] In one possible implementation, the model to be trained is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0230] In one possible implementation, the model to be trained is further used to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0231] In one possible implementation, the model to be trained is further used to preprocess the second feature based on the first feature to obtain the preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0232] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0233] In one possible implementation, the model to be trained is further used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; the model to be trained is used to: enhance the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtain the position information of the object in the training image based on the second enhanced first feature and the second enhanced second feature.

[0234] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0235] For a description of step 2002, please refer to the above. Figure 5 The illustrated embodiments and Figure 16 The relevant explanatory sections of the illustrated embodiments will not be repeated here.

[0236] 2003. Based on location information and the actual location information of objects in training images, the model to be trained is trained to obtain the target model.

[0237] After predicting the object's position in the training image, since the object's true position in the training image is known, a pre-defined loss function can be used to calculate both the object's position in the training image and its true position, thus obtaining the target loss. The target loss indicates the difference between the object's position in the training image and its true position. After obtaining the target loss, the parameters of the model to be trained can be updated based on the target loss, resulting in the updated model. The updated model is then trained using the next batch of training data until the model training conditions are met (e.g., the target loss converges, etc.). Figure 5 or Figure 16 The target model in the illustrated embodiment.

[0238] The target model trained in this embodiment has object detection capabilities. When object detection is required, a target image containing the object to be detected can be acquired and input into the target model. Next, the target model can extract features from the target image to obtain a first feature, and further extract features from the first feature to obtain a second feature. Then, the target model can perform a first fusion of the first and second features to obtain a first fusion result. Subsequently, the target model can enhance the first and second features based on the first fusion result to obtain enhanced first and second features. Finally, the target model can use the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0239] The above is a detailed description of the target detection method and model training method provided in the embodiments of this application. The target detection device and model training device provided in the embodiments of this application will be described below. Figure 21 A schematic diagram of the target detection device provided in the embodiments of this application is shown below. Figure 21 As shown, the device includes:

[0240] The acquisition module 2101 is used to acquire a target image, which contains the object to be detected;

[0241] The extraction module 2102 is used to extract features from the target image to obtain a first feature, and to extract features from the first feature to obtain a second feature;

[0242] The fusion module 2103 is used to perform a first fusion on the first feature and the second feature to obtain a first fusion result;

[0243] Enhancement module 2104 is used to enhance the first feature and the second feature based on the first fusion result to obtain the enhanced first feature and the enhanced second feature;

[0244] The detection module 2105 is used to obtain the position information of the object in the target image based on the enhanced first feature and the enhanced second feature.

[0245] In this embodiment, when object detection is required, a target image containing the object to be detected can be acquired and input into a target model. The target model then extracts features from the target image to obtain a first feature, and further extracts features from the first feature to obtain a second feature. The target model then performs a first fusion of the first and second features to obtain a first fusion result. Subsequently, based on the first fusion result, the target model enhances the first and second features to obtain enhanced first and second features. Finally, the target model uses the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0246] In one possible implementation, the enhancement module 2104 is configured to: inject the first fusion result into the first feature to obtain the enhanced first feature, and determine the second feature as the enhanced second feature; or, inject the first fusion result into the first feature to obtain the enhanced first feature, and inject the first fusion result into the second feature to obtain the enhanced second feature; or, determine the first feature as the enhanced first feature, and inject the first fusion result into the second feature to obtain the enhanced second feature.

[0247] In one possible implementation, enhancement module 2104 is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0248] In one possible implementation, enhancement module 2104 is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0249] In one possible implementation, the device further includes: a first preprocessing model for preprocessing the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, splicing, or convolution; and an enhancement module 2104 for processing the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0250] In one possible implementation, the device further includes: a second preprocessing module for preprocessing the second feature based on the first feature to obtain a preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, splicing, or convolution; and an enhancement module 2104 for processing the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain an enhanced second feature.

[0251] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0252] In one possible implementation, the device further includes: a second fusion module, used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; and a detection module 2105, used to: enhance the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtain the position information of the object in the target image based on the second enhanced first feature and the second enhanced second feature.

[0253] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0254] Figure 22 A schematic diagram of the model training apparatus provided in the embodiments of this application is shown below. Figure 22 As shown, the device includes:

[0255] The acquisition module 2201 is used to acquire training images, which contain objects to be detected;

[0256] The processing module 2202 is used to process the training image using the model to be trained to obtain the position information of the object in the training image. The model to be trained is used to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; enhance the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and obtain position information based on the enhanced first feature and the enhanced second feature.

[0257] The training module 2203 is used to train the model to be trained based on the location information and the real location information of the object in the training image to obtain the target model.

[0258] The target model trained in this embodiment has object detection capabilities. When object detection is required, a target image containing the object to be detected can be acquired and input into the target model. Next, the target model can extract features from the target image to obtain a first feature, and further extract features from the first feature to obtain a second feature. Then, the target model can perform a first fusion of the first and second features to obtain a first fusion result. Subsequently, the target model can enhance the first and second features based on the first fusion result to obtain enhanced first and second features. Finally, the target model can use the enhanced first and second features for detection to obtain the object's position information in the target image. This completes the object detection process. In the aforementioned process, the target model obtains the object's position information in the target image based on the enhanced first feature and the enhanced second feature. The enhanced first feature is obtained based on the first feature and the first fusion result, and the enhanced second feature is obtained based on the second feature and the first fusion result. The first feature and the second feature represent different local information of the target image, and the first fusion result represents the low-dimensional global information of the target image. Therefore, the target model considers a relatively comprehensive range of factors during the target detection process, and the final object position information output by the target model has sufficient accuracy to accurately complete the target detection.

[0259] In one possible implementation, the model to be trained is used to: inject a first fusion result into a first feature to obtain an enhanced first feature, and determine a second feature as the enhanced second feature; or, inject a first fusion result into a first feature to obtain an enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature; or, determine a first feature as the enhanced first feature, and inject the first fusion result into a second feature to obtain an enhanced second feature.

[0260] In one possible implementation, the model to be trained is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

[0261] In one possible implementation, the model to be trained is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0262] In one possible implementation, the model to be trained is further used to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain an enhanced first feature.

[0263] In one possible implementation, the model to be trained is further used to preprocess the second feature based on the first feature to obtain the preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; the model to be trained is used to process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature.

[0264] In one possible implementation, the first fusion includes at least one of the following: alignment, splicing, or convolution.

[0265] In one possible implementation, the model to be trained is further used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; the model to be trained is used to: enhance the enhanced first feature and the enhanced second feature based on the second fusion result to obtain a second enhanced first feature and a second enhanced second feature; and obtain the position information of the object in the training image based on the second enhanced first feature and the second enhanced second feature.

[0266] In one possible implementation, the second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

[0267] It should be noted that the information interaction and execution process between the modules / units of the above-mentioned device are based on the same concept as the method embodiment of this application, and the resulting technical effects are the same as those of the method embodiment of this application. For details, please refer to the description in the method embodiment shown above in the embodiment of this application, and it will not be repeated here.

[0268] This application also relates to an execution device. Figure 23 This is a schematic diagram of the execution device provided in an embodiment of this application. Figure 23 As shown, the execution device 2300 can specifically manifest as a mobile phone, tablet, laptop, smart wearable device, server, etc., and is not limited here. Among them, the execution device 2300 may deploy... Figure 21 The target detection device described in the corresponding embodiment is used to implement Figure 5 or Figure 16The corresponding embodiment describes the target detection function. Specifically, the execution device 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303, and a memory 2304 (wherein the execution device 2300 may have one or more processors 2303). Figure 23 (Taking a processor as an example), the processor 2303 may include an application processor 23031 and a communication processor 23032. In some embodiments of this application, the receiver 2301, transmitter 2302, processor 2303, and memory 2304 may be connected via a bus or other means.

[0269] Memory 2304 may include read-only memory and random access memory, and provides instructions and data to processor 2303. A portion of memory 2304 may also include non-volatile random access memory (NVRAM). Memory 2304 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0270] The processor 2303 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses in the diagram are referred to as the bus system.

[0271] The methods disclosed in the embodiments of this application can be applied to or implemented by the processor 2303. The processor 2303 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 2303 or by instructions in software form. The processor 2303 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor, or a microcontroller, and may further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor 2303 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 2304. Processor 2303 reads the information in memory 2304 and, in conjunction with its hardware, completes the steps of the above method.

[0272] Receiver 2301 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 2302 can be used to output digital or character information through the first interface; transmitter 2302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 2302 may also include a display device such as a display screen.

[0273] In one embodiment of this application, the processor 2303 is used to... Figure 5 or Figure 16 The target model in the corresponding embodiment is used to complete target detection.

[0274] This application also relates to a training device. Figure 24 This is a schematic diagram of the structure of a training device provided in an embodiment of this application. Figure 24As shown, the training device 2400 is implemented by one or more servers. The training device 2400 can vary significantly due to different configurations or performance. It may include one or more central processing units (CPUs) 2424 (e.g., one or more processors) and memory 2432, and one or more storage media 2430 (e.g., one or more mass storage devices) for storing application programs 2442 or data 2444. The memory 2432 and storage media 2430 can be temporary or persistent storage. The program stored in the storage media 2430 may include one or more modules (not shown in the figure), each module may include a series of instruction operations on the training device. Furthermore, the CPU 2424 may be configured to communicate with the storage media 2430 and execute the series of instruction operations in the storage media 2430 on the training device 2400.

[0275] The training device 2400 may also include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input / output interfaces 2458; or, one or more operating systems 2441, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0276] Specifically, the training equipment can perform Figure 20 The model training method in the corresponding embodiment is used to obtain the target model.

[0277] This application also relates to a computer storage medium storing a program for signal processing, which, when run on a computer, causes the computer to perform steps as performed by the aforementioned execution device, or causes the computer to perform steps as performed by the aforementioned training device.

[0278] This application also relates to a computer program product that stores instructions that, when executed by a computer, cause the computer to perform steps as performed by the aforementioned execution device, or to perform steps as performed by the aforementioned training device.

[0279] The execution device, training device, or terminal device provided in this application embodiment can specifically be a chip. The chip includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip within the execution device to execute the data processing method described in the above embodiments, or to cause the chip within the training device to execute the data processing method described in the above embodiments. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip within the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0280] For details, please refer to Figure 25 , Figure 25 This is a schematic diagram of the chip provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 2500. The NPU 2500 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 2503, which is controlled by the controller 2504 to retrieve matrix data from the memory and perform multiplication operations.

[0281] In some implementations, the arithmetic circuit 2503 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 2503 is a two-dimensional pulsating array. The arithmetic circuit 2503 can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2503 is a general-purpose matrix processor.

[0282] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 2502 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 2501 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is ​​stored in the accumulator 2508.

[0283] Unified memory 2506 is used to store input and output data. Weight data is directly transferred to weight memory 2502 via Direct Memory Access Controller (DMAC) 2505. Input data is also transferred to unified memory 2506 via DMAC.

[0284] BIU stands for Bus Interface Unit 2513, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 2509.

[0285] The Bus Interface Unit (BIU) 2513 is used by the instruction fetch memory 2509 to fetch instructions from external memory, and also by the memory access controller 2505 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0286] The DMAC is mainly used to move input data from external memory DDR to unified memory 2506, or to weight data to weight memory 2502, or to input data to input memory 2501.

[0287] The vector computation unit 2507 includes multiple processing units that further process the output of the computation circuit 2503 when needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of the predicted label plane.

[0288] In some implementations, vector computation unit 2507 can store the processed output vector in unified memory 2506. For example, vector computation unit 2507 can apply a linear function, or a nonlinear function, to the output of computation circuit 2503, such as linear interpolating the predicted label plane extracted from a convolutional layer, or, for example, accumulating a vector of values ​​to generate activation values. In some implementations, vector computation unit 2507 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as activation input to computation circuit 2503, for example, for use in subsequent layers of the neural network.

[0289] The instruction fetch buffer 2509 connected to the controller 2504 is used to store the instructions used by the controller 2504.

[0290] Unified memory 2506, input memory 2501, weighted memory 2502, and instruction fetch memory 2509 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0291] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0292] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0293] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0294] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0295] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A target detection method, characterized in that, The method is implemented through a target model, and the method includes: Acquire a target image, wherein the target image contains the object to be detected; Feature extraction is performed on the target image to obtain a first feature, and feature extraction is performed on the first feature to obtain a second feature; A first fusion is performed on the first feature and the second feature to obtain a first fusion result; The first fusion of the first feature and the second feature to obtain the first fusion result includes: aligning the second feature to the first feature using the first feature as the alignment standard to obtain the aligned second feature; concatenating the first feature and the aligned second feature to obtain the concatenated feature; and performing convolution processing on the concatenated feature to obtain the first fusion result. Based on the first fusion result, the first feature and the second feature are enhanced to obtain the enhanced first feature and the enhanced second feature; Based on the enhanced first feature and the enhanced second feature, the position information of the object in the target image is obtained.

2. The method of claim 1, wherein, The enhancement of the first feature and the second feature based on the first fusion result to obtain the enhanced first feature and the enhanced second feature includes: The first fusion result is injected into the first feature to obtain the enhanced first feature, and the second feature is determined as the enhanced second feature; or, The first fusion result is injected into the first feature to obtain the enhanced first feature, and the first fusion result is injected into the second feature to obtain the enhanced second feature; or, The first feature is determined as the enhanced first feature, and the first fusion result is injected into the second feature to obtain the enhanced second feature.

3. The method of claim 2, wherein, The step of injecting the first fusion result into the first feature to obtain the enhanced first feature includes: The first fusion result and the first feature are processed based on a cross-attention mechanism to obtain the enhanced first feature.

4. The method according to claim 2 or 3, characterized in that, The step of injecting the first fusion result into the second feature to obtain the enhanced second feature includes: The first fusion result and the second feature are processed based on a cross-attention mechanism to obtain the enhanced second feature.

5. The method of claim 3, wherein, The method further includes: Based on the second feature, the first feature is preprocessed to obtain the preprocessed first feature. The preprocessing includes at least one of the following: alignment, splicing, or convolution. The enhanced first feature is obtained by processing the first fusion result and the first feature based on a cross-attention mechanism, including: The first fusion result and the preprocessed first feature are processed based on a cross-attention mechanism to obtain the enhanced first feature.

6. The method of claim 4, wherein, The method further includes: Based on the first feature, the second feature is preprocessed to obtain the preprocessed second feature. The preprocessing includes at least one of the following: alignment, splicing, or convolution. The process of processing the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature includes: The first fusion result and the preprocessed second feature are processed based on a cross-attention mechanism to obtain the enhanced second feature.

7. The method according to any one of claims 1 to 6, characterized in that, The first fusion includes at least one of the following: alignment, splicing, or convolution.

8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: The enhanced first feature and the enhanced second feature are then fused a second time to obtain a second fusion result; Obtaining the position information of the object in the target image based on the enhanced first feature and the enhanced second feature includes: Based on the second fusion result, the enhanced first feature and the enhanced second feature are further enhanced to obtain the enhanced first feature and the enhanced second feature after two rounds of enhancement. Based on the first feature after secondary enhancement and the second feature after secondary enhancement, the position information of the object in the target image is obtained.

9. The method of claim 8, wherein, The second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

10. A model training method, comprising: The method includes: Acquire training images, which contain the objects to be detected; The training image is processed by a model to be trained to obtain the position information of the object in the training image. The model to be trained is used to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; wherein, the first fusion on the first feature and the second feature to obtain the first fusion result includes: aligning the second feature to the first feature using the first feature as an alignment standard to obtain an aligned second feature; concatenating the first feature and the aligned second feature to obtain a concatenated feature; performing convolution processing on the concatenated feature to obtain the first fusion result; enhancing the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and obtaining the position information based on the enhanced first feature and the enhanced second feature. Based on the location information and the actual location information of the object in the training image, the model to be trained is trained to obtain the target model.

11. The method of claim 10, wherein, The model to be trained is used for: The first fusion result is injected into the first feature to obtain the enhanced first feature, and the second feature is determined as the enhanced second feature; or, The first fusion result is injected into the first feature to obtain the enhanced first feature, and the first fusion result is injected into the second feature to obtain the enhanced second feature; or, The first feature is determined as the enhanced first feature, and the first fusion result is injected into the second feature to obtain the enhanced second feature.

12. The method of claim 11, wherein, The model to be trained is used to process the first fusion result and the first feature based on a cross-attention mechanism to obtain the enhanced first feature.

13. The method according to claim 11 or 12, characterized in that, The model to be trained is used to process the first fusion result and the second feature based on a cross-attention mechanism to obtain the enhanced second feature.

14. The method of claim 12, wherein, The model to be trained is further configured to preprocess the first feature based on the second feature to obtain a preprocessed first feature, wherein the preprocessing includes at least one of the following: alignment, concatenation, or convolution; The model to be trained is used to process the first fusion result and the preprocessed first feature based on a cross-attention mechanism to obtain the enhanced first feature.

15. The method of claim 13, wherein, The model to be trained is further configured to preprocess the second feature based on the first feature to obtain the preprocessed second feature, wherein the preprocessing includes at least one of the following: alignment, concatenation or convolution; The model to be trained is used to process the first fusion result and the preprocessed second feature based on a cross-attention mechanism to obtain the enhanced second feature.

16. The method according to any one of claims 10 to 15, characterized in that, The first fusion includes at least one of the following: alignment, splicing, or convolution.

17. The method according to any one of claims 10 to 16, characterized in that, The model to be trained is further used to perform a second fusion on the enhanced first feature and the enhanced second feature to obtain a second fusion result; The model to be trained is used for: Based on the second fusion result, the enhanced first feature and the enhanced second feature are further enhanced to obtain the enhanced first feature and the enhanced second feature after two rounds of enhancement. Based on the first feature after secondary enhancement and the second feature after secondary enhancement, the position information of the object in the training image is obtained.

18. The method of claim 17, wherein, The second fusion includes at least one of the following: alignment, splicing, processing based on a self-attention mechanism, processing based on a feedforward network, or addition.

19. A target detection apparatus characterized by comprising: The device includes a target model, and the device comprises: An acquisition module is used to acquire a target image, wherein the target image contains the object to be detected; The extraction module is used to extract features from the target image to obtain a first feature, and to extract features from the first feature to obtain a second feature; A fusion module is used to perform a first fusion on the first feature and the second feature to obtain a first fusion result; Specifically, the fusion module is used to: align the second feature to the first feature using the first feature as the alignment standard to obtain the aligned second feature; concatenate the first feature with the aligned second feature to obtain the concatenated feature; and perform convolution processing on the concatenated feature to obtain the first fusion result. An enhancement module is used to enhance the first feature and the second feature based on the first fusion result, so as to obtain the enhanced first feature and the enhanced second feature; The detection module is used to obtain the position information of the object in the target image based on the enhanced first feature and the enhanced second feature.

20. A model training apparatus, comprising: The device includes: An acquisition module is used to acquire training images, wherein the training images contain objects to be detected; A processing module is used to process the training image using a model to be trained to obtain the position information of the object in the training image. The model to be trained is used to: extract features from the training image to obtain a first feature, and extract features from the first feature to obtain a second feature; perform a first fusion on the first feature and the second feature to obtain a first fusion result; wherein, performing a first fusion on the first feature and the second feature to obtain the first fusion result includes: aligning the second feature to the first feature using the first feature as an alignment standard to obtain an aligned second feature; concatenating the first feature and the aligned second feature to obtain a concatenated feature; performing convolution processing on the concatenated feature to obtain the first fusion result; enhancing the first feature and the second feature based on the first fusion result to obtain an enhanced first feature and an enhanced second feature; and obtaining the position information based on the enhanced first feature and the enhanced second feature. The training module is used to train the model to be trained based on the location information and the actual location information of the object in the training image, so as to obtain the target model.

21. A target detection apparatus characterized by comprising: The target detection device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the target detection device performs the method as described in any one of claims 1 to 18.

22. A computer storage medium, comprising, The computer storage medium stores one or more instructions that, when executed by one or more computers, cause the one or more computers to perform the method of any one of claims 1 to 18.

23. A computer program product, characterised in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 18.

Citation Information

Patent Citations

  • Target detection method and electronic equipment

    CN114387496A