Methods, devices, servers, and media for small target detection based on feature reconstruction

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using feature reconstruction methods and cascading design of DRSS-Unit, MGSP-Unit, SDF-AIFI and CCFM modules, the problem of insufficient accuracy in small target detection is solved, and high-precision target detection in complex backgrounds is achieved.

CN121937840BActive Publication Date: 2026-06-30TIANJIN POLYTECHNIC UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TIANJIN POLYTECHNIC UNIV
Filing Date: 2026-03-31
Publication Date: 2026-06-30

Application Information

Patent Timeline

31 Mar 2026

Application

30 Jun 2026

Publication

CN121937840B

IPC: G06V10/82; G06V10/44; G06V10/40; G06V10/54; G06V10/77; G06V10/80; G06N3/045; G06N3/0464; G06N3/048; G06N3/0499; G06N3/0455; G06N3/09

AI Tagging

Technology Topics

Semantic feature Edge enhancement

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN121937840B_ABST

Patent Text Reader

Abstract

This invention discloses a method, apparatus, server, and medium for detecting small targets based on feature reconstruction, belonging to the field of target detection technology. It includes: inputting primary features obtained from image convolution into a two-layer DRSS-Unit module for processing to obtain shallow edge enhancement features; inputting the shallow edge enhancement features into an MGSP-Unit module for processing to obtain mid-level semantic transition features; inputting the mid-level semantic transition features into an MGSP-Unit module for processing to obtain deep sparse semantic features; inputting the deep sparse semantic features into an SDF-AIFI module for processing to obtain frequency domain global enhancement features; inputting the frequency domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features into a CCFM module for feature reconstruction to obtain reconstructed fusion features; and inputting the reconstructed fusion features into a decoder to obtain the small target detection result. By constructing a cascaded feature reconstruction and enhancement module, progressive reconstruction and enhancement of multi-level features are performed to achieve the detection of small targets in complex scenes.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection technology, and in particular to a method, apparatus, server, and medium for detecting tiny targets based on feature reconstruction. Background Technology

[0002] With the rapid development of intelligent monitoring, remote sensing Earth observation, and drone autopilot technologies, the detection of small targets has become a key challenge that urgently needs to be overcome in the field of computer vision. Small targets typically refer to objects that occupy a very low percentage of pixels in an image, have scarce visual features, and are easily affected by changes in lighting and complex backgrounds. The detection results directly affect the application effectiveness of related technologies in real-world scenarios.

[0003] Current object detection technology is mainly based on the end-to-end detection architecture of Transformer, represented by DETR (Detection Transformer) and its evolution model RT-DETR. It achieves global modeling through self-attention mechanism and has made significant progress in general object detection tasks. This technical route marks the transition of object detection from the CNN era to an architecture based on attention mechanism.

[0004] However, existing model structures are still designed for general scenarios and rely on static cross-scale feature interaction mechanisms. When facing tiny targets, the ability to represent the features of tiny targets is insufficient because the target occupies very few pixels, thus restricting the improvement of target detection performance. Summary of the Invention

[0005] This invention provides a method, apparatus, server, and medium for detecting small targets based on feature reconstruction, in order to solve the technical problem of low detection accuracy caused by insufficient feature representation capability of small targets in the prior art.

[0006] In a first aspect, embodiments of the present invention provide a method for detecting small targets based on feature reconstruction, comprising:

[0007] The primary features obtained by convolution of the image are input into a two-layer DRSS-Unit module for processing, which is used to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit.

[0008] The shallow edge enhancement features are input into the MGSP-Unit module for processing, which takes into account both the target spatial location and the preliminary semantic abstraction information, to obtain the middle-level semantic transition features.

[0009] The mid-level semantic transition features are input into the MGSP-Unit module for processing, which is used to capture the target global library context and deep category abstraction information to obtain deep sparse semantic features.

[0010] The MGSP-Unit module includes a global perception unit and a saliency reconstruction unit, which are used to accurately capture the key features of small targets;

[0011] The deep sparse semantic features are input into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network.

[0012] The frequency domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are input into the CCFM module for feature reconstruction to obtain reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets during the resolution reduction process, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features.

[0013] The reconstructed and fused features are input into the decoder to obtain the small target detection results.

[0014] Secondly, embodiments of the present invention also provide a small target detection device based on feature reconstruction, comprising:

[0015] The shallow edge enhancement feature acquisition module is used to input the primary features obtained by convolution of the image into a two-layer DRSS-Unit module for processing, which is used to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit.

[0016] The intermediate semantic transition feature acquisition module is used to input the edge enhancement features into the MGSP-Unit module for processing, and is used to take into account both the target spatial location and the preliminary semantic abstraction information to obtain the intermediate semantic transition features;

[0017] The deep sparse semantic feature acquisition module is used to input the semantic transition features into the MGSP-Unit module for processing, and to capture the target global library context and deep category abstract information to obtain deep sparse semantic features;

[0018] The frequency domain global enhancement module is used to input the deep sparse semantic features into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network.

[0019] The feature reconstruction and fusion module is used to input the frequency domain global enhancement features, shallow edge enhancement features and mid-level semantic transition features into the CCFM module for feature reconstruction to obtain reconstructed and fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets during the resolution reduction process, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features.

[0020] The target detection module is used to input the reconstructed fusion features into the decoder to obtain the small target detection result.

[0021] Thirdly, embodiments of the present invention also provide a server, comprising:

[0022] One or more processors;

[0023] Storage device for storing one or more programs;

[0024] When the one or more programs are executed by the one or more processors, the one or more processors implement the feature reconstruction-based small target detection method provided in the above embodiments.

[0025] Fourthly, embodiments of the present invention also provide a medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the feature-reconstruction-based small target detection method provided in the above embodiments.

[0026] The small target detection method, apparatus, server, and medium based on feature reconstruction provided in this invention involve inputting the primary features obtained from image convolution into a two-layer DRSS-Unit module for processing. This process extracts the edge set and spatial texture details of the small target, resulting in shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The shallow edge enhancement features are then input into an MGSP-Unit module for processing, which considers both the target's spatial location and preliminary semantic abstraction information, resulting in intermediate semantic transition features. Finally, the intermediate semantic transition features are input into an MGSP-Unit module for processing, which captures the target's global library context and deep category abstraction information, resulting in deep sparse semantic features. The module includes a global perception unit and a saliency reconstruction unit for accurately capturing key features of small targets. The deep sparse semantic features are input into the SDF-AIFI module for processing, resulting in frequency-domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The frequency-domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are input into the CCFM module for feature reconstruction, resulting in reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit preserves the spatial information of small targets during resolution reduction, and the dynamic aggregation unit performs adaptive fusion of cross-scale features. The reconstructed fused features are input into the decoder to obtain the small target detection result. Through the cascaded design of the DRSS-Unit, MGSP-Unit, SDF-AIFI, and CCFM modules, progressive feature reconstruction and enhancement of small targets from shallow edge textures to deep semantic context are achieved. The reconstructed fused features are then input into the decoder, enabling the detection of small targets in complex backgrounds. Attached Figure Description

[0027] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0028] Figure 1 This is a flowchart of the feature reconstruction-based small target detection method provided in Embodiment 1 of the present invention;

[0029] Figure 2 This is a schematic diagram of the DRSS-Unit module in the training mode of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0030] Figure 3This is a schematic diagram of the structure of the DRSS-Unit module in the inference mode of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0031] Figure 4 This is a schematic diagram of the MGSP-Unit module of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0032] Figure 5 This is a schematic diagram of the SDF-AIFI module of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0033] Figure 6 This is a schematic diagram of the structure of the non-destructive spatial downsampling unit of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0034] Figure 7 This is a schematic diagram of the structure of the dynamic aggregation unit of the feature reconstruction-based micro-target detection method provided in Embodiment 1 of the present invention;

[0035] Figure 8 This is a schematic diagram of the micro-target detection device based on feature reconstruction provided in Embodiment 2 of the present invention;

[0036] Figure 9 This is a structural diagram of the server provided in Embodiment 3 of the present invention. Detailed Implementation

[0037] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0038] Example 1

[0039] Figure 1 This is a flowchart of a feature reconstruction-based small target detection method provided in Embodiment 1 of the present invention. This embodiment is applicable to the detection of small targets in complex scenes, and specifically includes the following steps:

[0040] Step 110: The primary features obtained by convolution of the image are input into a two-layer DRSS-Unit module for processing, which is used to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit and an attention fusion unit.

[0041] The image refers to the image data of tiny targets input into the neural network model and preprocessed. Primary features refer to the shallow feature representation extracted after the first convolutional operation on the image. For example, images can be acquired through drone aerial photography, remote sensing satellites, or intelligent monitoring equipment. Acquired images are characterized by a wide field of view, high background complexity, and tiny, densely distributed targets. Tiny targets in these images often occupy only a small pixel area, and their edge textures are easily obscured by changes in lighting, noise interference, and background clutter. Therefore, the original image must first be preprocessed before being input into the backbone network for convolutional operations to obtain primary features that retain rich spatial details. Subsequently, the primary features are fed into a two-layer cascaded DRSS-Unit module for processing. By directionally extracting the edge set and spatial texture details of the tiny targets, shallow edge enhancement features rich in geometric information are finally obtained.

[0042] For example, such as Figure 2 As shown, the DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The three-branch convolutional fusion unit comprises a 3×3 convolutional branch, a 1×1 convolutional branch, and an identity mapping branch. Each of these branches is followed by a batch normalization layer to improve the convergence performance of deep networks. The 3×3, 1×1, and identity mapping branches are parallel structures; the input features are processed by each of the three branches, and the results are summed along the channel dimension to obtain the output features of the three-branch convolutional fusion unit. The texture branch unit includes a single-layer convolutional structure, a batch normalization layer, and an activation function. It utilizes the local perception capability of small convolutional kernels to preserve the texture details of small objects. The single-layer convolutional structure uses a 3×3 convolutional kernel to capture subtle local texture responses on shallow feature maps. The subsequent batch normalization layer normalizes the convolutional output. The activation function layer uses the sigmoid function to non-linearly map the feature responses to the [0,1] interval, thereby generating an enhanced texture attention response map. The edge branch unit includes... Horizontal strip convolution branches and The vertical strip convolution branch is used to sharpen the contour boundaries of small targets and suppress background texture noise inconsistent with the edge direction. The horizontal strip convolution branch uses a 1×7 convolution kernel to capture the edge response of the target along the horizontal direction, followed by a batch normalization layer. The vertical strip convolution branch uses a 7×1 convolution kernel to extract the contour features of the target along the vertical direction, also followed by a batch normalization layer. During processing, the input features are convolved and normalized through the two parallel branches, and the results are summed element-wise along the channel dimension. Then, a non-linear mapping is performed using an activation function to obtain the output features of the edge branch unit. This orthogonal decomposition strip convolution design enables targeted enhancement of the edges of small targets and effectively filters out background clutter interference outside the edge direction.

[0043] For example, the attention fusion unit includes a spatial attention branch and a channel attention branch, used to dynamically weight and fuse the texture and edge features of small targets. It should be noted that the attention fusion unit contains two parallel branches with identical structures, each consisting of a spatial attention branch and a channel attention branch. The first branch takes the output of the texture branch as input to dynamically modulate the texture features; the second branch takes the output of the edge branch as input to adaptively weight the edge features. After the two branches complete feature enhancement, the results are summed element-wise, thereby achieving deep fusion of the texture information and edge features of small targets. Specifically, the spatial attention branch generates attention weights in the spatial dimension. The input features are first subjected to global average pooling and global max pooling operations to aggregate global response information in the spatial dimension. These two results are then concatenated in the channel dimension to obtain a fused spatial feature map. This feature map is then subjected to a 7×7 convolution for feature interaction and mapping in the spatial dimension, generating an initial spatial attention map. Finally, the attention map is normalized to the [0,1] interval using the Sigmoid activation function to obtain the final spatial attention map. The channel attention branch generates attention weights along the channel dimension. The output features of the spatial attention branch are processed by global average pooling to compress the spatial information of each channel into a vector. This vector is then dimensionality-reduced through a first fully connected or convolutional layer, and a non-linear transformation is introduced using the ReLU activation function. Finally, a second fully connected or convolutional layer restores the original channel dimension. The output is then mapped to the [0,1] interval using the Sigmoid activation function to obtain the importance weights for each channel. These weights are used to scale the input features channel-by-channel, thereby strengthening feature channels rich in texture and edge information while weakening the responses of irrelevant channels.

[0044] For example, in training mode, the overall process of the DRSS-Unit module can be represented as follows: First, the input features are fed into a three-branch convolutional fusion unit and an edge branch unit for processing, respectively, to obtain the output features of the three-branch convolutional fusion unit and the edge branch. Then, the output features of the three-branch convolutional fusion unit are input into a texture branch unit to obtain the output features of the texture branch. The output features of the texture branch are then input into the first branch of the attention fusion unit for processing to obtain modulated texture features. Simultaneously, the output features of the edge branch are input into the second branch of the attention fusion unit for processing to obtain enhanced edge features. Finally, the texture features and edge features are element-wise summed and fused to obtain the final output features of the DRSS-Unit module. It should be noted that in inference mode, such as... Figure 3 As shown, based on the linear additivity of convolution operations, the 3×3 convolution branch, 1×1 convolution branch, and identity mapping branch, along with their subsequent batch normalization layer, in the three-branch convolutional fusion unit are structurally reparameterized and folded, effectively transforming it into a single-layer convolutional structure. The output of this equivalent convolution is directly used as the input to the texture branch unit, eliminating the need for explicit multi-branch summation, thus significantly improving model testing speed while maintaining functional equivalence. In inference mode, the remaining processing flow remains consistent with training mode, ultimately summing the texture features and edge features element-wise to obtain the output features of the DRSS-Unit module.

[0045] Step 120: Input the shallow edge enhancement features into the MGSP-Unit module for processing, which takes into account both the target spatial location and the preliminary semantic abstraction information, to obtain the mid-level semantic transition features.

[0046] Shallow features refer to the high-resolution feature maps output in the early stages of the backbone network. They are characterized by a large spatial size and a small receptive field, preserving rich spatial details such as low-level visual attributes like target edges, contours, and textures. For small target detection tasks, shallow features are key information carriers for accurately locating the target and capturing its geometric shape.

[0047] For example, the shallow edge enhancement features obtained in step 110 are input into the MGSP-Unit module for processing, such as... Figure 4As shown, the MGSP-Unit module includes a global perception unit and a saliency reconstruction unit, used to accurately capture key features of small targets. The global perception unit includes parallel branches of depthwise separable convolutions with different receptive fields, used to decompose large-size convolutional kernels into combinations of local convolutions and strip convolutions, expanding the receptive field along the horizontal and vertical directions, enhancing the ability to capture contextual information of small targets. Optionally, the parallel branches of depthwise separable convolutions with different receptive fields include 5×5 kernel depthwise separable convolutions, 1×11 kernel depthwise separable convolutions, and 11×1 kernel depthwise separable convolutions, used to capture local details, long-range context in the horizontal direction, and long-range context in the vertical direction, respectively. A feature identity mapping branch is used to preserve the original feature information of small targets. An attention fusion unit is used to fuse the output features of the parallel branches of depthwise separable convolutions with different receptive fields and the output features of the feature identity mapping branch, enhancing the expressive power of the fused features and improving the global perception accuracy of small target details. The global perception unit (GPU) processes the input features through parallel 5×5, 1×11, and 11×1 deep separable convolutions, respectively, to capture contextual information from different receptive fields, resulting in output features for each receptive field. Then, the output features from these deep separable convolutions, along with the original input features preserved by identity mapping, are fed into an attention fusion unit for adaptive fusion, yielding the fused features. Finally, the fused features are element-wise summed with the original input features along the channel dimension to obtain the output features of the global perception unit.

[0048] For example, the saliency reconstruction unit is used to process the output features of the global perception unit through layer normalization to obtain normalized perception features; the normalized perception features are then processed through a reference signal path and a context importance path to obtain basic gating weight features and context importance weight features; the basic gating weight features, context importance weight features, and normalized perception features are multiplied element-wise to obtain enhanced features; the enhanced features and normalized perception features are then fused through residual connections to obtain enhanced output features; the reference signal path includes feature extraction channels and activation functions, and the context importance path includes pooling layers, convolutional layers, activation functions, and bilinear upsampling. Specifically, the reference signal path extracts at least one representative channel from the input feature map, such as channel 0, as a reference signal, which is then processed by the activation function to generate basic gating weight features. Utilizing specific channels inherent in the feature map as guides, it reduces computational redundancy while preserving key spatial prior information. The context importance path first downsamples through a pooling layer to aggregate local information, then performs channel interaction via 1×1 convolution and generates saliency weights using an activation function. Finally, bilinear upsampling restores the weight map to its original size, achieving compression, activation, and restoration of local importance calculation. The saliency reconstruction unit achieves dynamic modulation and enhancement of small target features through a dual-path weight generation and residual fusion mechanism: the baseline signal path extracts representative channels to generate basic gating weights, providing lightweight spatial lead information; the context importance path calculates local saliency weights through "compression-activation-restoration" to capture regional-level contextual responses; the weights generated by the two paths are multiplied element-wise with normalized perceptual features, and then the original information is fused through residual connections, ultimately achieving accurate capture of key features of small targets and noise suppression. Under the dual action of the global perception unit and the saliency reconstruction unit, the MGSP-Unit module expands the receptive field through multi-branch depthwise separable convolution to capture multi-scale contextual information, and uses dual-path dynamic weight modulation to enhance the saliency of features and suppress noise. Thus, while preserving the precise spatial location of the target, it introduces preliminary semantic abstraction, realizes the effective transition and reconstruction of shallow detailed features to mid-level semantic features, and finally generates mid-level semantic transition features that take into account both spatial positioning ability and semantic expression ability.

[0049] Step 130: Input the intermediate semantic transition features into the MGSP-Unit module for processing, which is used to capture the target global library context and deep category abstraction information to obtain deep sparse semantic features.

[0050] Mid-level features refer to the medium-resolution feature maps output from the intermediate stages of the backbone network. Their spatial size is typically between that of shallow and deep features, with a moderate receptive field. These features combine shallow spatial details with deep semantic abstraction in their information content: on the one hand, they retain certain target location and edge information through progressive downsampling; on the other hand, through abstract learning via multiple convolutions, they begin to encode the target's category semantics and contextual relationships. For example, under the dual action of the global perception unit and the saliency reconstruction unit, the MGSP-Unit module further expands the global receptive field through depthwise separable convolutions with different receptive fields, capturing the long-range contextual dependencies of small targets. It also utilizes the dual-path dynamic weight modulation mechanism in the saliency reconstruction unit to perform sparsity filtering and semantic enhancement of features. This effectively suppresses complex background noise under a large field of view while preserving key target responses, achieving deep extraction of mid-level semantic features from high-level abstract semantics, ultimately generating deep sparse semantic features rich in global contextual information and category discrimination capabilities.

[0051] Step 140: Input the deep sparse semantic features into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network.

[0052] Deep features refer to the low-resolution feature maps output in the final stage of the backbone network. They have a small spatial size and a large receptive field, and after multiple downsampling and nonlinear transformations, they gradually focus on the high-level semantic abstraction of the target. These features are characterized by significantly compressed spatial details, but encode rich category semantics, global context, and long-range dependencies between targets along the channel dimension. For small target detection tasks, although deep features have weaker spatial localization capabilities, they can provide crucial category discrimination information, helping to distinguish small targets from interference in complex backgrounds. Simultaneously, the large receptive field captures the environmental context of the target, thereby improving the robustness and accuracy of detection.

[0053] For example, such as Figure 5As shown, the SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The multi-head attention unit includes a window partitioning projection branch and an expanding mask branch. The window partitioning projection branch segments the deep sparse semantic features into non-overlapping local windows, and generates a query vector, key vector, and value vector within each window through linear projection. The expanding mask branch introduces a preset expanding mask when calculating the attention scores of the query vector and key vector to preserve the correlation between pixels at specific spatial intervals within the window, simulating the wide receptive field of dilated convolution. Optionally, the processing flow of the multi-head attention unit is as follows: First, the input deep sparse semantic features are spatially partitioned into multiple non-overlapping local windows to obtain windowed feature blocks. Within each local window, a corresponding query Q, key K, and value V are generated through cubic linear projection. Then, the query Q and key K are scaled and dot-producted to obtain an initial attention score matrix. A preset expanding mask is introduced into the initial attention score matrix, and the two are added together to obtain a sparse attention score matrix. The dilated mask is configured to preserve only the associations between pixels with specific spatial intervals within the window, thereby simulating the sparse connection pattern of dilated convolution within the local window. Next, the sparsified attention score matrix is multiplied by the value V to obtain the single-head attention output features. Finally, the output features of multiple attention heads are concatenated and fused using linear projection to obtain the multi-head window sparse attention enhancement features, which serve as the output features of the multi-head attention unit. Through the dilated mask mechanism, the multi-head attention unit effectively overcomes the field-of-view limitations of the local window without significantly increasing computational cost, expanding the contextual awareness range for small targets.

[0054] For example, such as Figure 5As shown, the self-attention unit includes a global attention weight generation branch, which redistributes attention weights to the output features of the multi-head attention unit through linear projection, scaling dot product, and activation function processing; and a feature sparse transformation and splitting branch, which performs sparse encoding and channel splitting on the output features of the multi-head attention unit through linear projection, scaling, and activation function processing. Optionally, the processing flow of the self-attention unit is as follows: the output features of the multi-head attention unit are first processed by layer normalization to stabilize the feature distribution, and then three parallel linear projections are used to generate query vector Q, key vector K, and value vector V, respectively. Based on this, the global attention weight generation branch performs scaling dot product operation on Q and K to obtain an initial attention score matrix, which is normalized by the Softmax activation function and multiplied by the learnable parameter W1 to achieve adaptive adjustment of the intensity of the global attention distribution. At the same time, the feature sparse transformation and splitting branch splits the value vector V into an identity mapping path and a concatenated processing path, where V in the concatenated path is successively processed by a scaling module to adjust the amplitude, a ReLU activation function to suppress background noise, and then multiplied by the learnable parameter W2 to generate sparse content-gated features. Subsequently, the global attention weights are multiplied element-wise with the identity mapping of V for the first time, completing the standard global context feature aggregation. The aggregation result is then multiplied element-wise a second time with the sparse content-gated features, and the features are further filtered using a sparse mask generated by ReLU to further suppress background noise. Finally, the filtered features are added to the output features of the original multi-head attention unit through a residual connection to obtain the self-attention output features, thereby achieving precise enhancement of small target features while preserving key semantic responses.

[0055] For example, such as Figure 5As shown, the feedforward network includes parallel convolutional and depthwise separable convolutional branches for extracting local window features and global spatial context features of small targets in parallel, and providing rich multi-scale representations through feature fusion. A frequency domain transformation branch is used to transform the small target features to the frequency domain using Fast Fourier Transform, adaptively filtering frequency components using learnable spectral gating, and then inversely transforming back to the spatial domain to enhance the global structural feature representation. Optionally, the processing flow of the self-attention unit is as follows: the output features of the self-attention unit are first processed by layer normalization to stabilize the feature distribution, and then split and fed into the parallel convolutional and depthwise separable convolutional branches. In the convolutional branch, the features are first processed by 1×1 convolution for channel interaction, and after nonlinear mapping by the GeLU activation function, they enter the unfolding module to spatially segment the feature map into overlapping local blocks, preparing for subsequent local frequency domain analysis. In the depthwise separable convolutional branch, the features are first processed by depthwise separable convolution to capture spatial context information, and after the GeLU activation function, they are then processed by 1×1 convolution for channel adjustment to match the feature dimension of the local window encoding branch. The local window features output from the convolutional and depthwise separable convolutional branches are element-wise added and fused with the spatial context features to obtain the fused spatial domain features. Subsequently, the fused features are transformed from the spatial domain to the frequency domain using a Fast Fourier Transform (FFT). In the frequency domain, they are multiplied element-wise with a learnable global frequency domain weight matrix Wfreq to achieve adaptive spectral gating filtering. The global frequency domain weight matrix can adaptively filter frequency components, such as enhancing high-frequency signals representing edge details or suppressing low-frequency components that suppress background noise. The filtered frequency domain features are restored back to the spatial domain using an Inverse Fast Fourier Transform (IFT), and then the local blocks are reassembled into a complete two-dimensional feature map through a folding operation. Finally, the reassembled features are added to the output features of the original self-attention unit through residual connections to output the final frequency domain global enhanced features. Through the above dual-path encoding and frequency domain gating collaborative mechanism, the feedforward network effectively enhances the ability to preserve local details of small targets while supplementing the shortcomings of convolutional networks in capturing global structural information.

[0056] The SDF-AIFI module, which integrates multi-head attention units, self-attention units, and a feedforward network, optimizes deep sparse semantic features. Specifically, the multi-head attention unit expands the context-aware range using an extended mask mechanism, the self-attention unit suppresses background noise through dual-path sparse transformation, and the feedforward network employs a frequency-domain gating collaborative mechanism to enhance global structural features. The final output is a frequency-domain globally enhanced feature rich in global contextual information with effectively suppressed background noise.

[0057] Step 150: Input the frequency domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features into the CCFM module for feature reconstruction to obtain reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets during the resolution reduction process, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features.

[0058] Feature reconstruction is a technique in computer vision that recombines, calibrates, and enhances original features to generate high-quality feature representations. Addressing the challenge of small targets having low pixel counts and being easily obscured by background, a CCFM module is introduced to collaboratively reconstruct frequency-domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features across scales. For example, the lossless spatial downsampling unit ensures zero loss of spatial information for small targets during resolution reduction, while the dynamic aggregation unit achieves adaptive fusion of cross-scale features through a gated shift mechanism, ultimately generating reconstructed fused features that retain fine details and are rich in high-level semantics.

[0059] For example, the CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit is used to preserve the spatial information of small targets during resolution reduction, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features. Figure 6 As shown, the lossless spatial downsampling unit includes a spatial-to-depth transformation branch, which splits and rearranges adjacent pixels in the feature space dimension according to a scaling factor to the channel dimension, in order to preserve high-frequency details of small targets and original pixel information; and a dynamic reweighting branch, which uses global average pooling, convolutional layers, and The operator adaptively weights the sub-pixel channels of the stacked features to suppress background noise and enhance target information. The feature correction branch uses global average pooling, convolutional layers, and the Tanh activation function to dynamically correct and compensate for the amplitude of the features. Optionally, the lossless spatial downsampling unit process involves first transforming the input features from space to depth using a branch that splits and rearranges adjacent pixels in the spatial dimension to the channel dimension according to a preset scaling factor of 2, resulting in stacked features that retain all original pixel information. Subsequently, this stacked feature is fed into two parallel sub-branches: a dynamic reweighting branch and a feature correction branch. The dynamic reweighting branch uses global average pooling and a 1×1 convolutional layer to perceive the global context, generates dynamic weights using the Sigma operator, and multiplies them element-wise with the stacked features to adaptively suppress the sub-pixel channels corresponding to background noise and enhance target information. Simultaneously, the feature correction branch extracts feature distribution information through global average pooling and a 1×1 convolutional layer. Bias parameters are generated using the Tanh activation function and added element-wise to the multiplied features to dynamically correct and compensate for the amplitude of the filtered features, preventing excessive attenuation of weak target signals. Finally, the features, after undergoing a double multiplication-addition transformation, are subjected to channel compression and information exchange via a 1×1 convolution to obtain the output features of the lossless spatial downsampling unit.

[0060] For example, such as Figure 7As shown, the dynamic aggregation unit includes a motion-gated branch, which uses grouped cyclic shifting to enable pixels to perceive neighborhood information, followed by convolutional fusion and activation functions to generate a spatially gated weight map; and a content interaction branch, which uses contextual priors and extended features to generate keys and queries, calculates affinity matrices, and predicts dynamic convolution kernel parameters for adaptive filtering to extract morphological features of small targets. Optionally, the dynamic fusion unit processing flow is as follows: the input feature tensor is first expanded in channel dimension by 1×1 convolution to enrich the feature representation; the expanded features are then split into three paths, which are fed into the motion-gated branch, the content interaction branch, and the dynamic convolution data path, respectively. The motion-gated branch groups the input features in the channel dimension and performs left / right / up / down cyclic shifting to enable each pixel to perceive neighborhood information, followed by 1×1 convolutional fusion and a sigmoid activation function to generate a spatially gated weight map. The content interaction branch introduces external contextual priors, generating a key vector K through global average pooling and 1×1 convolution. Simultaneously, the extended features are convolved into a query vector Q. The two are multiplied to calculate an affinity matrix, and the dynamic convolution kernel parameters are predicted via a linear layer. This dynamic convolution kernel is applied to the extended features to adaptively extract morphological features of small targets. Subsequently, the dynamic convolutional features output from the content interaction branch are multiplied element-wise with the spatial gating weights output from the motion-gated branch to suppress background noise. The modulated features are reduced to the original channels via 1×1 convolution, and finally added to the original input features through residual connections to obtain the output features of the dynamic fusion unit. Through the dual mechanisms of the motion-gated branch and the content interaction branch, the dynamic aggregation unit achieves accurate capture and cross-scale adaptive fusion of small target features. Finally, the frequency domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are input into the CCFM module for feature reconstruction. The CCFM module aligns the spatial resolution of multi-scale features through upsampling units, utilizes lossless spatial downsampling units to fully preserve pixel-level information of minute targets during downsampling, and leverages dynamic aggregation units to achieve adaptive fusion and gated modulation of cross-scale features. After these processes, the module outputs reconstructed fused features that simultaneously contain rich spatial details, enhanced semantic information, and effectively suppressed background noise, providing high-quality feature representations for subsequent decoders to achieve high-precision detection of minute targets.

[0061] Step 160: Input the reconstructed fusion features into the decoder to obtain the small target detection result.

[0062] As a key component in the object detection model, the decoder undertakes the core task of mapping abstract feature representations to specific detection results. For example, the decoder employs a multi-layer Transformer structure, taking the reconstructed fusion features output from the CCFM module as input. Through hierarchical self-attention and cross-attention mechanisms, it progressively refines the spatial boundaries of small targets and accurately determines their category. Each layer of the decoder iteratively optimizes and progressively updates the target query vector, ensuring that the position and shape of the predicted bounding box continuously approximate the true target boundary. To improve training efficiency and model performance, each layer of the decoder is equipped with an auxiliary prediction head to generate supervision signals for intermediate layers, forming a deep supervision mechanism. This design alleviates the gradient vanishing problem during deep network training, accelerates model convergence, and enhances the model's localization accuracy and classification confidence for small targets through multi-level feature optimization. Finally, the decoder outputs a series of predicted bounding boxes and their corresponding category confidence scores, completing the full transformation from reconstructed fusion features to the final detection result, achieving high-precision detection of small targets in complex scenes.

[0063] This embodiment processes the primary features obtained from image convolution into a two-layer DRSS-Unit module to extract small target edge sets and spatial texture details, resulting in shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The shallow edge enhancement features are then processed by the MGSP-Unit module to consider both target spatial location and preliminary semantic abstraction information, resulting in mid-level semantic transition features. These mid-level semantic transition features are further processed by the MGSP-Unit module to capture the target's global library context and deep category abstraction information, resulting in deep sparse semantic features. The MGSP-Unit module includes a global perception unit and a saliency re-emphasis unit. A cascaded design of DRSS-Unit, MGSP-Unit, SDF-AIFI, and CCFM modules is used to accurately capture key features of small targets. The deep sparse semantic features are then processed in the SDF-AIFI module to obtain frequency-domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The frequency-domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are then input into the CCFM module for feature reconstruction, resulting in reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit preserves the spatial information of small targets during resolution reduction, and the dynamic aggregation unit adaptively fuses cross-scale features. The reconstructed fused features are then input into the decoder to obtain the small target detection result. Through the cascaded design of DRSS-Unit, MGSP-Unit, SDF-AIFI, and CCFM modules, progressive feature reconstruction and enhancement of small targets from shallow edge textures to deep semantic context is achieved. The reconstructed fused features are then input into the decoder, enabling the detection of small targets in complex backgrounds.

[0064] Example 2

[0065] Figure 8 This is a schematic diagram of the micro-target detection device based on feature reconstruction provided in Embodiment 2 of the present invention, as shown below. Figure 8 As shown, the device includes:

[0066] The shallow edge enhancement feature acquisition module 210 is used to input the primary features obtained by convolution of the image into a two-layer DRSS-Unit module for processing, and to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit and an attention fusion unit.

[0067] The intermediate semantic transition feature acquisition module 220 is used to input the shallow edge enhancement features into the MGSP-Unit module for processing, and is used to take into account both the target spatial location and the preliminary semantic abstraction information to obtain the intermediate semantic transition features.

[0068] The deep sparse semantic feature acquisition module 230 is used to input the intermediate semantic transition features into the MGSP-Unit module for processing, in order to capture the target global library context and deep category abstract information to obtain deep sparse semantic features;

[0069] The frequency domain global enhancement module 240 is used to input the deep sparse semantic features into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network.

[0070] The feature reconstruction and fusion module 250 is used to input the frequency domain global enhancement features, shallow edge enhancement features and mid-level semantic transition features into the CCFM module for feature reconstruction to obtain reconstructed and fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets in the reduced resolution, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features.

[0071] The target detection module 260 is used to input the reconstructed fusion features into the decoder to obtain the small target detection result.

[0072] The feature-reconstructed micro-target detection device provided in this embodiment processes the primary features obtained by convolution of an image into a two-layer DRSS-Unit module to extract the edge set and spatial texture details of micro-targets, resulting in shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The shallow edge enhancement features are then processed by the MGSP-Unit module to consider both the target's spatial location and preliminary semantic abstraction information, resulting in mid-level semantic transition features. Finally, the mid-level semantic transition features are processed by the MGSP-Unit module to capture the target's global library context and deep category abstraction information, resulting in deep sparse semantic features. The MGSP-Unit module includes a full... The system employs a local perception unit and a saliency reconstruction unit to accurately capture key features of minute targets. The deep sparse semantic features are input into the SDF-AIFI module for processing, yielding frequency-domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The frequency-domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are then input into the CCFM module for feature reconstruction, resulting in reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit preserves the spatial information of minute targets during resolution reduction, while the dynamic aggregation unit adaptively fuses cross-scale features. The reconstructed fused features are then input into the decoder to obtain the minute target detection result. Through the cascaded design of the DRSS-Unit, MGSP-Unit, SDF-AIFI, and CCFM modules, progressive feature reconstruction and enhancement of minute targets from shallow edge textures to deep semantic context are achieved. The reconstructed fused features are then input into the decoder, enabling the detection of minute targets in complex backgrounds.

[0073] Example 3

[0074] Figure 9 This is a schematic diagram of the structure of a server provided in Embodiment 3 of the present invention. Figure 9 A block diagram of an exemplary server 12 suitable for implementing embodiments of the present invention is shown. Figure 9 The server 12 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of the present invention.

[0075] like Figure 9 As shown, server 12 is presented in the form of a general-purpose computing server. The components of server 12 may include, but are not limited to: one or more processors or processing units 16, system memory 28, and bus 18 connecting different system components (including system memory 28 and processing unit 16).

[0076] Bus 18 represents one or more of several bus architectures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of the various bus architectures. For example, these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect (PCI) bus.

[0077] Server 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by server 12, including volatile and non-volatile media, removable and non-removable media.

[0078] System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and / or cache 32. Server 12 may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, storage system 34 may be used to read and write non-removable, non-volatile magnetic media (… Figure 9 Not shown; usually referred to as a "hard drive"). Although Figure 9 Not shown, a disk drive for reading and writing to a removable non-volatile disk (e.g., a "floppy disk") and an optical disk drive for reading and writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 18 via one or more data media interfaces. System memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to perform the functions of the embodiments of the present invention.

[0079] A program / utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28. Such program modules 42 include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. Program modules 42 typically perform the functions and / or methods described in the embodiments of the present invention.

[0080] Server 12 can also communicate with one or more external devices 14 (e.g., keyboard, pointing server, display 24, etc.), and with one or more servers that enable users to interact with server 12, and / or with any server (e.g., network card, modem, etc.) that enables server 12 to communicate with one or more other computing servers. This communication can be performed via input / output (I / O) interface 22. Furthermore, server 12 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 20. As shown, network adapter 20 communicates with other modules of server 12 via bus 18. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with server 12, including but not limited to: microcode, server drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0081] The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the feature reconstruction-based micro-target detection method provided in the embodiments of the present invention.

[0082] Example 4

[0083] Embodiment 4 of the present invention also provides a medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the feature reconstruction-based micro-target detection method as described in any of the above embodiments.

[0084] The computer storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0085] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.

[0086] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including—but not limited to—wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0087] Computer program code for performing the operations of this invention can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as "C" or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0088] Note that the above description is merely a preferred embodiment of the present invention and the technical principles employed. Those skilled in the art will understand that the present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions can be made without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include many other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the appended claims.

Claims

1. A method for detecting small targets based on feature reconstruction, characterized in that, include: The primary features obtained by convolution of the image are input into a two-layer DRSS-Unit module for processing, which is used to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The shallow edge enhancement features are input into the MGSP-Unit module for processing, which takes into account both the target spatial location and the preliminary semantic abstraction information, to obtain the middle-level semantic transition features. The mid-level semantic transition features are input into the MGSP-Unit module for processing, which is used to capture the target global library context and deep category abstraction information to obtain deep sparse semantic features. The MGSP-Unit module includes a global perception unit and a saliency reconstruction unit, which are used to accurately capture the key features of small targets; The deep sparse semantic features are input into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The frequency domain global enhancement features, shallow edge enhancement features, and mid-level semantic transition features are input into the CCFM module for feature reconstruction to obtain reconstructed fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit, and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets during the resolution reduction process, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features. The reconstructed and fused features are input into the decoder to obtain the small target detection results.

2. The method according to claim 1, characterized in that, The three-branch convolutional fusion unit includes: The network consists of a 3×3 convolution branch, a 1×1 convolution branch, and an identity mapping branch, each followed by a batch normalization layer to improve the convergence performance of deep networks. The texture branch unit includes: Single-layer convolutional structures, batch normalization layers, and activation functions are used to utilize the local perception capabilities of small convolutional kernels to preserve the texture details of tiny objects. The edge branch unit includes: Horizontal strip convolution branches and The vertical strip convolution branch is used to sharpen the contour boundaries of small targets and suppress background texture noise that is inconsistent with the edge direction; The attention fusion unit includes: Spatial attention branch and channel attention branch are used to dynamically weight and fuse texture and edge features of small targets.

3. The method according to claim 1, characterized in that, The global perception unit includes: Different receptive field depths can separate convolutional parallel branches, which are used to decompose large-size convolutional kernels into a combination of local convolutions and strip convolutions, expanding the receptive field in the horizontal and vertical directions and enhancing the ability to capture contextual information of small targets. The feature identity mapping branch is used to preserve the original feature information of small targets; The attention fusion unit is used to fuse the output features of the depth-separable convolution parallel branches of different receptive fields and the output features of the feature identity mapping branch, thereby enhancing the expressive power of the fused features and improving the global perception accuracy of small target details.

4. The method according to claim 3, characterized in that, The saliency reconstruction unit is used for: The output features of the global perception unit are processed by layer normalization to obtain normalized perception features; The normalized sensing features are passed through the reference signal path and the context importance path respectively to obtain the basic gating weight features and the context importance weight features. The basic gating weight features, context importance weight features, and normalized perception features are multiplied element-wise to obtain the enhanced features; The enhanced features are fused with the normalized perceptual features via residual connection to obtain the enhanced output features. The baseline signal path includes a feature extraction channel and an activation function, while the context importance path includes a pooling layer, a convolutional layer, an activation function, and bilinear upsampling.

5. The method according to claim 1, characterized in that, The multi-head attention unit includes: The window is divided into projection branches and extended mask branches; The window partitioning projection branch divides the deep sparse semantic features into non-overlapping local windows, and generates query vector, key vector and value vector respectively through linear projection within each window; The expanded mask branch introduces a preset expanded mask when calculating the attention scores of the query vector and the key vector, which is used to preserve the correlation between pixels at a specific spatial interval of the window and simulate the wide receptive field of dilated convolution. The self-attention unit includes: The global attention weight generation branch redistributes attention weights to the output features of the multi-head attention unit through linear projection, scaled dot product, and activation function processing. Feature sparse transformation and channel splitting are performed on the output features of the multi-head attention unit through linear projection, scaling and activation function processing to sparsely encode and split the channels. The feedforward network includes: Parallel convolutional and depthwise separable convolutional branches are used to extract local window features and global spatial context features of small targets in parallel, and provide rich multi-scale representations through feature fusion; The frequency domain transform branch is used to transform the features of small targets to the frequency domain through fast Fourier transform, adaptively filter the frequency components using learnable spectrum gating, and then inversely transform back to the spatial domain to enhance the representation of global structural features.

6. The method according to claim 1, characterized in that, The lossless spatial downsampling unit includes: The spatial-to-depth transformation branch preserves high-frequency details and original pixel information of small targets by splitting and rearranging adjacent pixels in the feature space dimension according to a scaling factor to the channel dimension. Dynamic reweighted branches, through global average pooling, convolutional layers, and The operator enables adaptive weighting of sub-pixel channels of stacked features to suppress background noise and enhance target information; The feature correction branch achieves dynamic correction and amplitude compensation of features through global average pooling, convolutional layers, and the Tanh activation function.

7. The method according to claim 1, characterized in that, The dynamic aggregation unit includes: The moving gated branch enables pixels to perceive neighborhood information through grouped cyclic shifting, and then generates a spatial gated weight map through convolution fusion and activation function; The content interaction branch generates keys and queries by utilizing contextual priors and extended features, calculates the affinity matrix, and predicts dynamic convolution kernel parameters for adaptive filtering to extract morphological features of small targets.

8. A micro-target detection device based on feature reconstruction, characterized in that, include: The shallow edge enhancement feature acquisition module is used to input the primary features obtained by convolution of the image into a two-layer DRSS-Unit module for processing, which is used to extract the set of small target edges and spatial texture details to obtain shallow edge enhancement features. The DRSS-Unit module includes a three-branch convolutional fusion unit, a texture branch unit, an edge branch unit, and an attention fusion unit. The intermediate semantic transition feature acquisition module is used to input the shallow edge enhancement features into the MGSP-Unit module for processing, and is used to take into account both the target spatial location and the preliminary semantic abstraction information to obtain the intermediate semantic transition features; The deep sparse semantic feature acquisition module is used to input the intermediate semantic transition features into the MGSP-Unit module for processing, in order to capture the target global library context and deep category abstract information to obtain deep sparse semantic features; The frequency domain global enhancement module is used to input the deep sparse semantic features into the SDF-AIFI module for processing to obtain frequency domain global enhancement features. The SDF-AIFI module includes a multi-head attention unit, a self-attention unit, and a feedforward network. The feature reconstruction and fusion module is used to input the frequency domain global enhancement features, shallow edge enhancement features and mid-level semantic transition features into the CCFM module for feature reconstruction to obtain reconstructed and fused features. The CCFM module includes an upsampling unit, a lossless spatial downsampling unit and a dynamic aggregation unit. The lossless spatial downsampling unit is used to retain the spatial information of small targets during the resolution reduction process, and the dynamic aggregation unit is used for adaptive fusion of cross-scale features. The target detection module is used to input the reconstructed fusion features into the decoder to obtain the small target detection result.

9. A server, characterized in that, The server includes: One or more processors; Storage device for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the feature reconstruction-based small target detection method as described in any one of claims 1-7.

10. A medium containing computer-executable instructions, which, when executed by a computer processor, are used to perform the feature reconstruction-based small target detection method as described in any one of claims 1-7.

Citation Information

Patent Citations

Improved target detection method in automatic driving scene based on RT-DETR
CN118644824A
Infrared weak and small target detection method based on hybrid perception state space model
CN119478372A

Patent Information

Abstract

Description

Patent Citations

Improved target detection method in automatic driving scene based on RT-DETR

Infrared weak and small target detection method based on hybrid perception state space model