An end-to-end target detection method based on feature fusion and electronic equipment
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HENAN INST OF ENG
- Filing Date
- 2026-04-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing end-to-end target detection methods suffer from problems such as missed detection of small targets and inaccurate localization of large targets when dealing with multi-scale targets. Furthermore, the training process has slow convergence speed and unstable matching, making it difficult to maintain robustness in complex scenarios.
An end-to-end target detection method based on feature fusion is designed. It adopts a feature fusion module with a multi-branch path structure and a self-attention mechanism, combined with a denoising training strategy and a one-to-one label allocation method, to optimize the multi-scale feature fusion and training mechanism, thereby improving the detection accuracy and training stability of the model for multi-scale targets.
It significantly improves the model's detection accuracy for multi-scale targets, mitigates the problems of missed detection of small targets and ambiguous boundary localization, enhances training iteration efficiency and detection stability in complex scenarios, and also features flexible backbone network compatibility, adapting to various decoder structures to achieve a balance between accuracy and efficiency.
Smart Images

Figure CN122244552A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and image processing technology, specifically relating to an end-to-end target detection method and electronic device based on feature fusion. Background Technology
[0002] In recent years, end-to-end object detection methods, represented by DETR, have successfully eliminated the dependence on anchor box setting and complex post-processing in traditional detection methods by leveraging the global modeling capabilities of Transformers, thus driving the development of object detection technology. However, these methods still have significant drawbacks when dealing with multi-scale objects: they rely on single-scale global feature maps for object modeling. While high-resolution feature maps can preserve details of small objects, they lack sufficient semantic information, while low-resolution feature maps are rich in semantic information but lose spatial details of small objects. This leads to problems such as missed detection of small objects and inaccurate localization of large objects in multi-scale scenarios.
[0003] To address the challenges of multi-scale detection, subsequent DETR variants, such as DINO-DETR and DN-DETR, attempted to introduce multi-scale feature encoders or improve label allocation strategies. However, limitations remain: the fusion of multi-scale features often relies on simple channel concatenation or weighted summation, lacking deep modeling of semantic relationships between feature levels and failing to effectively integrate low-level details with high-level semantics. Existing fusion modules, such as CSPNet and ELAN, while improving feature representation capabilities, suffer from insufficient support for multi-scale feature interaction due to CSPNet's cross-stage connections, and ELAN's parallel branch structure incurs high computational overhead, making it difficult to achieve a balance between detection accuracy and computational efficiency.
[0004] Meanwhile, DETR-like methods face challenges in training due to slow convergence and unstable matching. Because of the computational nature of global self-attention mechanisms, the model struggles to quickly establish accurate semantic correspondences between queries and targets in the early stages of training, and lacks a guidance mechanism for local salient features, leading to ambiguity in label assignment. Existing denoising training strategies often rely on single noise perturbation methods, offering limited improvement in robustness to complex scenarios. While one-to-one label assignment can alleviate matching ambiguity, it lacks refined design of the loss function, making it difficult to simultaneously ensure the accuracy of target localization and category prediction. These issues collectively restrict the deployment and application of end-to-end object detection methods in real-world scenarios. Summary of the Invention
[0005] To address the aforementioned technical challenges, this invention proposes an end-to-end target detection method and electronic device based on feature fusion, focusing on two dimensions: multi-scale feature fusion and training mechanism optimization. The aim is to improve the model's detection accuracy and training stability for multi-scale targets, providing a better solution for the practical application of end-to-end target detection technology.
[0006] To address the aforementioned technical problems, this invention provides a technical solution: an end-to-end target detection method based on feature fusion, characterized by comprising: inputting the image to be detected into a pre-trained FUSION-DETR target detection model to obtain target information in the image to be detected, wherein the target information includes the location information and target type information of all targets; wherein the FUSION-DETR target detection model is an improvement upon the DETR model, and its improvements include:
[0007] (1) Design a bottleneck encoder, which includes a multi-scale feature processing unit that integrates a feature fusion module. The feature fusion module adopts a multi-branch path structure to fuse and optimize the multi-scale features output by the backbone network, so as to reduce redundant feature tokens and enhance feature expression capabilities.
[0008] (2) In the bottleneck encoder, a self-attention mechanism (SA) and a selection network (Select) are introduced. The self-attention mechanism is used to capture the global contextual dependencies of features, and the selection network is used to filter high-confidence feature tokens and generate the position embeddings required by the decoder.
[0009] (3) The model is trained using a denoising training strategy and a one-to-one label assignment method. The bottleneck encoder is used as the feature preprocessing stage and the decoder is used as the detection head for decoupling design, so as to be compatible with a variety of DETR-type models and backbone networks.
[0010] Furthermore, the multi-scale feature processing unit of the bottleneck encoder specifically includes the following steps:
[0011] (1) Receive four feature maps F1, F2, F3 and F4 of different scales output by the backbone network, where F4 is obtained by downsampling F3 and does not go through the feature fusion module;
[0012] (2) In the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network, the feature map pairs (F1 and F2, F2 and F3) are preprocessed respectively: the low-resolution feature maps F1 and F2 are first enhanced by 3×3 convolution (stance 1, padding 1) and then input into independent Fusion fusion modules; within each fusion module, the features are first unified to 256 dimensions by 1×1 convolution, and then feature fusion is completed by "fast enhancement branch" and "deep interaction branch". The fast enhancement branch compresses the channels by 3×3 depth separable convolution and 1×1 convolution and loops once, while the deep interaction branch achieves multi-level feature interaction by looping twice through residual splicing structure. Finally, the fused features S1 (F1 and F2 fusion) and S2 (F2 and F3 fusion) are output, and the two fusion modules are executed in parallel to improve computational efficiency.
[0013] (3) The features S1 and S2 obtained by dual-scale fusion are concatenated with the original feature map F4 in the channel dimension. The concatenated features are input into a 1×1 convolutional layer. This convolutional layer is initialized with He normality and has no activation function, compressing the number of channels from 768 dimensions to 256 dimensions to match the input dimension requirements of the subsequent multi-head self-attention mechanism. The feature map after dimension adjustment is flattened into a two-dimensional feature token sequence. The sequence length corresponds to the spatial size of the input image after 1 / 16 downsampling. Each token has a dimension of 256 and is directly input into the self-attention module for global context modeling.
[0014] (4) Flatten the adjusted features into a token sequence, input it into the self-attention mechanism for global context modeling, then generate the position embedding through the selection network, and finally output it to the decoder.
[0015] Furthermore, the feature fusion module adopts a multi-branch path structure, specifically including: the input features are first split into two parallel branches, each branch first completes the initial adjustment of the channel dimension through 1×1 convolution, one branch serves as a direct mapping path to retain the original features, and the other branch enters the multi-branch residual structure for in-depth processing;
[0016] In the multi-branch residual structure, the features are divided into two paths: one path extracts spatial context information through a 3×3 convolution (stance 1, padding 1), and the other path compresses the channel dimension through a 1×1 convolution and then concatenates it with the former to form a cross-stage local connection. This structure is executed N2=2 times in a loop to achieve deep interaction of features at different levels.
[0017] At the same time, an additional fast enhancement branch is set up. After enhancing local details through 3×3 convolution, the channel is compressed through 1×1 convolution. This branch is executed N1=1 times in a loop to improve feature expression in a lightweight way.
[0018] All branch outputs are first spliced together to achieve preliminary fusion, preserving the complementary information of low-level detailed features and high-level semantic features. Then, they are summed element-wise with the features of the direct mapping path to enhance gradient flow. Finally, 1×1 convolution is used to unify the feature dimensions and generate integrated features with stronger generalization ability. The whole process effectively reduces computational redundancy through spatial enhancement of 3×3 convolution and channel compression of 1×1 convolution.
[0019] Furthermore, the self-attention mechanism and the selection network are specifically implemented as follows: the feature map after dimensional adjustment is flattened into a two-dimensional feature token sequence, which is input into a multi-head self-attention mechanism. This mechanism contains 8 attention heads, each of which divides the feature dimension into 32 dimensions. It captures global contextual dependencies by calculating the similarity weights between tokens. The feature tokens output by the self-attention mechanism are input into the selection network. This network calculates the confidence score of each token through a multilayer perceptron, selects the top 50% of high-confidence tokens, generates standardized position embeddings for them, and finally outputs them to the decoder.
[0020] Furthermore, the specific implementation of the denoising training strategy is as follows: during the model training stage, a combination of Gaussian noise and mask noise is added to the feature tokens input to the decoder. The mean of the Gaussian noise is set to 0 and the variance range is 0.01~0.05, which is used to simulate random interference in the feature extraction process. The mask noise is controlled at 10%~20% to randomly mask some feature tokens to enhance the robustness of the model to local feature loss.
[0021] The noise intensity can be dynamically adjusted according to the number of training iterations: in the first 30% of the iterations, higher intensity noise (variance 0.05, occlusion ratio 20%) is used to improve the model's generalization ability, and in the last 70% of the iterations, lower intensity noise (variance 0.01, occlusion ratio 10%) is used to focus on parameter fine-tuning. After the noise-added feature tokens are input into the decoder, the decoder corrects and reconstructs the noise features through a cross-attention mechanism. At the same time, it combines the backpropagation gradient of the real label to guide the model to learn the ability to extract effective target information from the noisy features, thereby improving the model's detection stability in complex scenarios such as changes in lighting and occlusion interference.
[0022] Furthermore, the specific implementation of the one-to-one label allocation method and the corresponding loss function is as follows: a matching cost matrix between the prediction result and the real label is constructed based on the Hungarian algorithm. This cost function is composed of a weighted average of the class confidence loss (0.4%) and the bounding box regression loss (0.6%). The one-to-one matching between the prediction result and the real label is achieved through matrix optimal matching, ensuring that each prediction box corresponds to only one real target label and each real target label is assigned to only one prediction box with the highest confidence and the smallest regression error, thus avoiding training ambiguity caused by many-to-one or one-to-many matching.
[0023] Among them, the bounding box regression loss ( The loss function CIoU (Complete Intersection over Union) is adopted, incorporating information on the center distance, aspect ratio difference, and overlap of the bounding boxes. The loss calculation formula is as follows:
[0024]
[0025] In the formula, , The adjustment coefficients are set to 0.1 and 0.05 respectively.
[0026] Category loss ( The caustic loss is used to alleviate the class imbalance problem and focus training on hard examples. The loss formula is as follows:
[0027]
[0028] In the formula, Set the focus parameter to 2. The confidence level of the target class predicted by the model;
[0029] The overall loss function of the model is:
[0030]
[0031] In the formula, The category loss weight is set to 1.0. By backpropagating the overall loss function, the accuracy of target location regression and category prediction is optimized simultaneously to ensure the consistency of the model training objectives.
[0032] Furthermore, the FUSION-DETR target detection model possesses backbone network compatibility and architecture variant adaptation capabilities, specifically implemented as follows:
[0033] (1) Backbone network compatibility: The model supports access to any convolutional neural network or a Transformer-based backbone network, including but not limited to ResNet-50 and Swin-T. The backbone network is used to extract multi-scale initial feature maps of the image to be detected. When ResNet-50 is accessed, the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network correspond to 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the input image downsampled. When Swin-T is accessed, the scale consistency of the output feature maps is maintained by adjusting the stride of the Patch Embedding.
[0034] (2) Architecture variant selection: Two feature fusion module integration position variants are designed, namely, the fusion module is placed before the encoder (variant a) and the fusion module is placed after the encoder (variant b); the computational cost of variant a is reduced by 30%~40% compared with variant b, and the detection accuracy of variant b is improved by no more than 2% compared with variant a. In practical applications, variant a is selected to reduce computational overhead while ensuring detection performance.
[0035] (3) Decoder compatibility: The decoder part can be adapted to the decoder structure of various DETR-type models, including but not limited to DINO-DETR and DN-DETR decoders. Through a unified feature dimension alignment and embedding generation mechanism, it can achieve seamless connection with different decoders.
[0036] To solve the above-mentioned technical problems, another technical solution provided by the present invention is: an electronic device, including a memory and a processor, characterized in that the memory stores a computer program, and the processor executes the computer program to implement the end-to-end target detection method based on feature fusion as described in any one of claims 1 to 7.
[0037] To solve the above-mentioned technical problems, another technical solution provided by the present invention is: a computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the end-to-end target detection method based on feature fusion as described in any one of claims 1 to 7.
[0038] The beneficial effects of this invention are as follows:
[0039] (1) Optimize the multi-scale feature fusion performance and solve the problem of scale perception imbalance in traditional DETR-type models. The feature fusion module designed in this invention deeply draws on the cross-scale modeling of FPN, the cross-stage connection of CSPNet, and the multi-branch feature reuse of ELAN. It innovatively adopts a dual-path structure of "fast enhancement branch + deep interaction branch". Through the collaborative design of 3×3 convolutional spatial enhancement and 1×1 convolutional channel compression, it effectively reduces computational redundancy while preserving low-level detailed features and enhancing high-level semantic features. Compared with single fusion modules, such as CSPNet's insufficient cross-scale interaction and ELAN's excessive computational overhead, this module achieves multi-level feature deep interaction through dual-scale parallel fusion and residual splicing mechanism, which significantly improves the model's detection accuracy for multi-scale targets, especially improving the problems of missed detection of small targets and blurred boundary localization.
[0040] (2) Constructing an efficient bottleneck encoder to address the core pain points of slow training convergence and unstable matching in DETR. This invention integrates a self-attention mechanism and a selection network into the encoder, forming an integrated preprocessing flow of feature fusion, global modeling, and token selection. First, the self-attention mechanism accurately captures the global contextual dependencies of features, compensating for the limitations of local perception in traditional encoders. The selection network filters high-confidence feature tokens and generates standardized position embeddings, significantly reducing the interference of redundant tokens on the decoding process and accelerating the establishment of semantic correspondence between the query and the target. At the same time, the decoupled design of the encoder and decoder allows feature preprocessing and the target detection head to perform their respective functions, further improving the training convergence speed and optimization stability. Compared with the original DETR and variant models, the training iteration efficiency is improved and the occurrence rate of matching ambiguity is reduced.
[0041] (3) Innovative training mechanism design to enhance the model's adaptability to complex scenarios and detection robustness. This invention adopts a denoising training strategy combining Gaussian noise and mask noise. By dynamically adjusting the noise intensity, the model can effectively cope with complex scenarios such as changes in lighting and occlusion interference. The one-to-one label allocation method is based on the Hungarian algorithm to construct a weighted cost matrix. Combined with the refined design of CIoU loss and caustic loss, it avoids training ambiguity caused by many-to-one or one-to-many matching, and simultaneously optimizes the target localization accuracy and category prediction performance. This makes the detection stability of the model in dense target and low-texture scenarios significantly better than existing methods.
[0042] (4) Improve model compatibility and engineering practicality to adapt to deployment requirements in multiple scenarios. The FUSION-DETR model of this invention has flexible backbone network and decoder compatibility, can be seamlessly connected to ConvNet and Transformer backbone networks, and is compatible with various decoder structures such as DINO-DETR and DN-DETR. Through the selection of architecture variants, the computational overhead is reduced by 30% to 40% compared with variant b, without the detection accuracy loss being less than 2%, achieving the optimal balance between accuracy and efficiency. It can meet the high-precision detection requirements of the server side and adapt to the lightweight deployment scenarios of the edge devices, thus broadening the practical application scope of end-to-end target detection technology.
[0043] To make the above and other objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0044] To more clearly illustrate the technical solutions in this invention or the prior art, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only four of the drawings in this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0045] Figure 1 This is a schematic diagram of the architecture of the end-to-end target detection model based on feature fusion in this invention;
[0046] Figure 2 A schematic diagram of the specific architecture for feature fusion used in the CSPNet model;
[0047] Figure 3 This is a schematic diagram of the specific architecture of feature fusion used in the end-to-end target detection model based on feature fusion in this invention;
[0048] Figure 4 This is a schematic diagram of the specific architecture of dual-path residual fusion in the feature fusion-based end-to-end target detection model used in this invention. Detailed Implementation
[0049] Embodiments of the invention will now be described in detail with reference to the accompanying drawings. While some embodiments of the invention are shown in the drawings, it should be understood that the invention can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustrative purposes only and are not intended to limit the scope of protection of the invention.
[0050] Example 1:
[0051] An end-to-end target detection method based on feature fusion includes: inputting the image to be detected into a pre-trained FUSION-DETR target detection model to obtain target information in the image to be detected, wherein the target information includes the location information and target type information of all targets.
[0052] like Figure 1 As shown, the end-to-end target detection model based on feature fusion is an improvement on the DETR model. The improvements include: (1) designing a bottleneck encoder, which includes a multi-scale feature processing unit that integrates a feature fusion module. The feature fusion module draws on FPN and CSPNet (e.g., Figure 2 As shown in the figure, the design concept of ELAN is used to fuse and optimize the multi-scale features output by the backbone network, so as to reduce redundant feature tokens and enhance feature representation capabilities.
[0053] (2) In the bottleneck encoder, a self-attention mechanism (SA) and a selection network (Select) are introduced. The self-attention mechanism is used to capture the global contextual dependencies of features, and the selection network is used to filter high-confidence feature tokens and generate the position embeddings required by the decoder.
[0054] (3) The model is trained using a denoising training strategy and a one-to-one label assignment method. The bottleneck encoder is used as the feature preprocessing stage and the decoder is used as the detection head for decoupling design, so as to be compatible with a variety of DETR-type models and backbone networks.
[0055] The multi-scale feature processing unit of the bottleneck encoder includes:
[0056] (1) Receive four feature maps F1, F2, F3 and F4 of different scales output by the backbone network, where F4 is obtained by downsampling F3 and does not go through the feature fusion module;
[0057] (2) In the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network, the feature map pairs (F1 and F2, F2 and F3) are first preprocessed: the low-resolution feature maps F1 and F2 are first enhanced by 3×3 convolution (stance 1, padding 1), and then input into independent Fusion fusion modules; within each fusion module, the features are first unified to 256 dimensions by 1×1 convolution, and then feature fusion is completed by "fast enhancement branch" and "deep interaction branch". The fast enhancement branch compresses the channels by 3×3 depth separable convolution and 1×1 convolution and loops once, while the deep interaction branch achieves multi-level feature interaction by looping twice through residual splicing structure. Finally, the fused features S1 (fusion of F1 and F2) and S2 (fusion of F2 and F3) are output, and the two sets of fusion modules (such as Figure 3 (As shown) Parallel execution is used to improve computational efficiency;
[0058] (3) The features S1 and S2 obtained by dual-scale fusion are concatenated with the original feature map F4 in the channel dimension. The concatenated features are input into a 1×1 convolutional layer. This convolutional layer is initialized with He normality and has no activation function, compressing the number of channels from 768 dimensions to 256 dimensions to match the input dimension requirements of the subsequent multi-head self-attention mechanism. The feature map after dimension adjustment is flattened into a two-dimensional feature token sequence. The sequence length corresponds to the spatial size of the input image after 1 / 16 downsampling. Each token has a dimension of 256 and is directly input into the self-attention module for global context modeling.
[0059] (4) Flatten the adjusted features into a token sequence, input it into the self-attention mechanism (SA) for global context modeling, generate position embeddings through the Select network, and finally output it to the decoder.
[0060] The feature fusion module (such as) Figure 4(As shown) Specifically, the input features are first split into two parallel branches. Each branch first undergoes a 1×1 convolution to initially adjust the channel dimensions. One branch serves as a direct mapping path, preserving the original features, while the other branch enters a multi-branch residual structure for deep processing. In the multi-branch residual structure, the features are divided into two paths: one path extracts spatial context information through a 3×3 convolution, and the other path compresses the channel dimensions through a 1×1 convolution before being concatenated with the former to form a cross-stage local connection. This structure is executed N²=2 times to achieve deep interaction between features at different levels. Additionally, additional settings are configured... A fast enhancement branch enhances local details through 3×3 convolutions and then compresses channels through 1×1 convolutions. This branch is executed N1=1 times in a lightweight manner to improve feature representation. All branch outputs are first concatenated to achieve preliminary fusion, preserving the complementary information between low-level detailed features and high-level semantic features. Then, they are summed element-wise with the features of the direct mapping path to enhance gradient flow. Finally, 1×1 convolutions are used to unify feature dimensions and generate integrated features with stronger generalization ability. The entire process effectively reduces computational redundancy through spatial enhancement of 3×3 convolutions and channel compression of 1×1 convolutions.
[0061] The specific implementation of the denoising training strategy is as follows: During the model training phase, a combination of Gaussian noise and mask noise is added to the feature tokens input to the decoder. The mean of the Gaussian noise is set to 0, and the variance range is 0.01~0.05, which is used to simulate random interference in the feature extraction process. The mask noise occlusion ratio is controlled at 10%~20%, which is used to randomly occlude some feature tokens to enhance the robustness of the model to local feature loss. The noise intensity can be dynamically adjusted according to the number of training iterations. In the early stage of the iteration, a higher intensity of noise is used to improve the model's generalization ability. In the later stage of the iteration, the noise intensity is reduced to focus on parameter fine-tuning. After the feature tokens with added noise are input into the decoder, the decoder corrects and reconstructs the noisy features through a cross-attention mechanism. At the same time, it combines the backpropagation gradient of the real label to guide the model to learn the ability to extract effective target information from noisy features, thereby improving the detection stability of the model in complex scenarios such as illumination changes and occlusion interference.
[0062] The specific implementation of the one-to-one label allocation method and its corresponding loss function is as follows: A matching cost matrix between the predicted result and the true label is constructed based on the Hungarian algorithm. This cost function is composed of a weighted average of a class confidence loss (0.4%) and a bounding box regression loss (0.6%). A one-to-one match between the predicted result and the true label is achieved through matrix optimal matching, ensuring that each predicted box corresponds to only one true target label, and each true target label is assigned only to the predicted box with the highest confidence and the smallest regression error. This avoids training ambiguity caused by many-to-one or one-to-many matching. The bounding box regression loss (… The CIoU loss function is used, incorporating information such as the center distance of the bounding boxes, aspect ratio differences, and overlap. The loss calculation formula is as follows:
[0063]
[0064] In the formula, α and β are adjustment coefficients, set to 0.1 and 0.05 respectively, to ensure the accuracy of bounding box regression.
[0065] Category loss ( ) using caustic loss ( To alleviate class imbalance and focus on training with difficult examples, the loss calculation formula is:
[0066]
[0067] In the formula Set the focus parameter to 2. The confidence level of the target category predicted by the model.
[0068] The overall loss function of the model is:
[0069]
[0070] In the formula The category loss weight is set to 1.0. By backpropagating the overall loss function, the accuracy of target location regression and category prediction is optimized simultaneously to ensure the consistency of the model training objectives.
[0071] The backbone network compatibility of the FUSION-DETR model is specifically as follows: it supports the integration of any convolutional neural network, such as ResNet-50, ConvNeXt-T, or Transformer-based backbone networks, such as Swin-T, ViT-B, etc. When ResNet-50 is integrated, the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network correspond to 1 / 4, 1 / 8, 1 / 16, and 1 / 32 downsampled versions of the input image, respectively. When Swin-T is integrated, the scale consistency of the output feature maps is maintained by adjusting the stride of the Patch Embedding to ensure the compatibility of subsequent feature fusion modules.
[0072] The specific architecture variant selection is as follows: two feature fusion module integration position variants are designed. Variant a places the fusion module before the encoder and directly fuses the multi-scale features output by the backbone network, reducing the computational cost by 30% to 40% compared to variant b. Variant b places the fusion module after the encoder and performs secondary fusion on the features output by self-attention, improving the detection accuracy by no more than 2% compared to variant a. In practical applications, variant a is selected, which significantly reduces the computational overhead while ensuring detection performance and is suitable for edge deployment scenarios.
[0073] Example 2:
[0074] An electronic device includes a memory and a processor, characterized in that the memory stores a computer program, and the processor executes the computer program to implement the end-to-end target detection method based on feature fusion as described in Embodiment 1 above.
[0075] The processor can be a general-purpose processor, such as a central processing unit (CPU), digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.
[0076] Furthermore, memory, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. Memory can include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM), static random access memory (SRAM), programmable read-only memory (PROM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic memory, magnetic disk, optical disk, etc. Memory is any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. The memory in the embodiments of this application can also be a circuit or any other device capable of implementing storage functions for storing program instructions and / or data.
[0077] Example 3:
[0078] A computer-readable storage medium storing a computer program, characterized in that, when executed by a processor, the computer program implements the end-to-end target detection method based on feature fusion as described in Embodiment 1 above.
[0079] The technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a readable computer storage medium and includes several instructions / computer programs to cause an Internet of Things device (which may be a personal computer, server, or network terminal, etc.) or processor to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, as well as electronic terminals such as computers, mobile phones, laptops, and tablets that have the aforementioned storage media.
[0080] The description of the execution process of program data in a computer-readable storage medium can be found in the descriptions in the various method embodiments of this application above, and will not be repeated here.
[0081] Note that the above description is merely a preferred embodiment of the present invention and the technical principles employed. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions can be made without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the appended claims.
Claims
1. An end-to-end object detection method based on feature fusion, characterized in that, include: The image to be detected is input into a pre-trained FUSION-DETR object detection model to obtain object information in the image. This object information includes the location and type information of all objects. The FUSION-DETR object detection model is an improvement upon the DETR model, and its improvements include: (1) Design a bottleneck encoder, which includes a multi-scale feature processing unit that integrates a feature fusion module. The feature fusion module adopts a structure that integrates cross-scale modeling, cross-stage connection and multi-branch feature reuse to fuse and optimize the multi-scale features output by the backbone network, so as to reduce redundant feature tokens and enhance feature expression capabilities. (2) In the bottleneck encoder, a self-attention mechanism (SA) and a selection network (Select) are introduced. The self-attention mechanism is used to capture the global contextual dependencies of features, and the selection network is used to filter high-confidence feature tokens and generate the position embeddings required by the decoder. (3) The model is trained using a denoising training strategy and a one-to-one label assignment method. The bottleneck encoder is used as the feature preprocessing stage and the decoder is used as the detection head for decoupling design, so as to be compatible with a variety of DETR-type models and backbone networks.
2. The end-to-end target detection method based on feature fusion according to claim 1, characterized in that, The multi-scale feature processing unit of the bottleneck encoder specifically includes the following steps: (1) Receive four feature maps F1, F2, F3 and F4 of different scales output by the backbone network, where F4 is obtained by downsampling F3 and does not go through the feature fusion module; (2) In the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network, the feature map pairs (F1 and F2, F2 and F3) are preprocessed respectively: the low-resolution feature maps F1 and F2 are first enhanced by 3×3 convolution (stance 1, padding 1) and then input into independent Fusion fusion modules; within each fusion module, the features are first unified to 256 dimensions by 1×1 convolution, and then feature fusion is completed by "fast enhancement branch" and "deep interaction branch". The fast enhancement branch compresses the channels by 3×3 depth separable convolution and 1×1 convolution and loops once, while the deep interaction branch achieves multi-level feature interaction by looping twice through residual splicing structure. Finally, the fused features S1 (fusion of F1 and F2) and S2 (fusion of F2 and F3) are output, and the two fusion modules are executed in parallel to improve computational efficiency. (3) The features S1 and S2 obtained by dual-scale fusion are concatenated with the original feature map F4 in the channel dimension. The concatenated features are input into a 1×1 convolutional layer. This convolutional layer is initialized with He normality and has no activation function, compressing the number of channels from 768 dimensions to 256 dimensions to match the input dimension requirements of the subsequent multi-head self-attention mechanism. The feature map after dimension adjustment is flattened into a two-dimensional feature token sequence. The sequence length corresponds to the spatial size of the input image after 1 / 16 downsampling. Each token has a dimension of 256 and is directly input into the self-attention module for global context modeling. (4) Flatten the adjusted features into a token sequence, input it into the self-attention mechanism for global context modeling, then generate the position embedding through the selection network, and finally output it to the decoder.
3. The end-to-end target detection method based on feature fusion according to claim 1 or 2, characterized in that, The feature fusion module adopts a structure that integrates cross-scale modeling, cross-stage connection, and multi-branch feature reuse. Specifically, the input features are first split into two parallel branches. Each branch first completes the initial adjustment of the channel dimension through 1×1 convolution. One branch serves as a direct mapping path to retain the original features, while the other branch enters a multi-branch residual structure for in-depth processing. In the multi-branch residual structure, the features are divided into two paths: one path extracts spatial context information through a 3×3 convolution (stance 1, padding 1), and the other path compresses the channel dimension through a 1×1 convolution and then concatenates it with the former to form a cross-stage local connection. This structure is executed N2=2 times in a loop to achieve deep interaction of features at different levels. At the same time, an additional fast enhancement branch is set up. After enhancing local details through 3×3 convolution, the channel is compressed through 1×1 convolution. This branch is executed N1=1 times in a loop to improve feature expression in a lightweight way. All branch outputs are first spliced together to achieve preliminary fusion, preserving the complementary information of low-level detailed features and high-level semantic features. Then, they are summed element-wise with the features of the direct mapping path to enhance gradient flow. Finally, 1×1 convolution is used to unify the feature dimensions and generate integrated features with stronger generalization ability. The whole process effectively reduces computational redundancy through spatial enhancement of 3×3 convolution and channel compression of 1×1 convolution.
4. The end-to-end target detection method based on feature fusion according to any one of claims 1 to 3, characterized in that, The self-attention mechanism and selection network are implemented as follows: the feature map after dimensional adjustment is flattened into a two-dimensional feature token sequence and input into a multi-head self-attention mechanism, which contains 8 attention heads. Each attention head divides the feature dimension into 32 dimensions and captures global contextual dependencies by calculating the similarity weights between tokens. The feature tokens output by the self-attention mechanism are input into the selection network, which calculates the confidence score of each token through a multilayer perceptron, selects the top 50% of high-confidence tokens, generates standardized position embeddings for them, and finally outputs them to the decoder.
5. The end-to-end target detection method based on feature fusion according to any one of claims 1 to 3, characterized in that, The specific implementation of the denoising training strategy is as follows: During the model training stage, a combination of Gaussian noise and mask noise is added to the feature tokens input to the decoder. The mean of the Gaussian noise is set to 0 and the variance range is 0.01~0.05, which is used to simulate random interference in the feature extraction process. The masking ratio of the mask noise is controlled at 10%~20%, which is used to randomly mask some feature tokens to enhance the robustness of the model to local feature loss. The noise intensity can be dynamically adjusted according to the number of training iterations: higher intensity noise (variance 0.05, occlusion ratio 20%) is used in the first 30% of the iterations to improve the model's generalization ability, and lower intensity noise (variance 0.01, occlusion ratio 10%) is used in the last 70% of the iterations to focus on fine-tuning the parameters. After the noise-added feature tokens are input into the decoder, the decoder corrects and reconstructs the noisy features through a cross-attention mechanism. At the same time, it combines the backpropagation gradient of the real label to guide the model to learn the ability to extract effective target information from the noisy features, thereby improving the detection stability of the model in complex scenarios such as changes in lighting and occlusion interference.
6. The end-to-end target detection method based on feature fusion according to any one of claims 1 to 3, characterized in that, The specific implementation of the one-to-one label allocation method and the corresponding loss function is as follows: a matching cost matrix between the prediction result and the real label is constructed based on the Hungarian algorithm. This cost function is composed of a weighted average of the class confidence loss (0.4%) and the bounding box regression loss (0.6%). The one-to-one matching between the prediction result and the real label is achieved by solving the matrix optimal matching, ensuring that each prediction box corresponds to only one real target label and each real target label is assigned to only one prediction box with the highest confidence and the smallest regression error, thus avoiding training ambiguity caused by many-to-one or one-to-many matching. Among them, the bounding box regression loss ( The loss function CIoU (Complete Intersection over Union) is adopted, incorporating information on the center distance, aspect ratio difference, and overlap of the bounding boxes. The loss calculation formula is as follows: In the formula, , The adjustment coefficients are set to 0.1 and 0.05 respectively. Category loss ( The caustic loss is used to alleviate the class imbalance problem and focus training on hard examples. The loss is calculated as follows: In the formula, Set the focus parameter to 2. The confidence level of the target class predicted by the model; The overall loss function of the model is: In the formula, The category loss weight is set to 1.
0. By backpropagating the overall loss function, the accuracy of target location regression and category prediction is optimized simultaneously to ensure the consistency of the model training objectives.
7. The end-to-end target detection method based on feature fusion according to any one of claims 1 to 3, characterized in that, The FUSION-DETR target detection model has backbone network compatibility and architecture variant adaptation capabilities, as specifically implemented below: (1) Backbone network compatibility: The model supports access to any convolutional neural network or a Transformer-based backbone network, including but not limited to ResNet-50 and Swin-T. The backbone network is used to extract multi-scale initial feature maps of the image to be detected. When ResNet-50 is accessed, the multi-scale feature maps F1, F2, F3, and F4 output by the backbone network correspond to 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the input image downsampled. When Swin-T is accessed, the scale consistency of the output feature maps is maintained by adjusting the stride of the Patch Embedding. (2) Architecture variant selection: Two feature fusion module integration position variants are designed, namely, the fusion module is placed before the encoder (variant a) and the fusion module is placed after the encoder (variant b); the computational cost of variant a is reduced by 30%~40% compared with variant b, and the detection accuracy of variant b is improved by no more than 2% compared with variant a. In practical applications, variant a is selected to reduce computational overhead while ensuring detection performance. (3) Decoder compatibility: The decoder part can be adapted to the decoder structure of various DETR-type models, including but not limited to DINO-DETR and DN-DETR decoders. Through a unified feature dimension alignment and embedding generation mechanism, it can achieve seamless connection with different decoders.
8. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and when the processor executes the computer program, it implements the end-to-end target detection method based on feature fusion as described in any one of claims 1 to 7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the end-to-end target detection method based on feature fusion as described in any one of claims 1 to 7.