SAR image weakly supervised target detection based on frequency domain enhancement and hybrid candidate box generation
By enhancing the feature discrimination capability of SAR images through contour wave frequency domain feature convolution and adaptive residual frequency domain attention module, and combining multi-class tokens to generate high-quality pseudo bounding boxes, the problem of insufficient candidate box quality and feature discrimination in weakly supervised target detection of SAR images is solved, and more efficient target detection is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
AI Technical Summary
Existing weakly supervised target detection methods for SAR images suffer from low candidate box quality and insufficient feature discrimination ability, resulting in poor detection performance.
A contour wave frequency domain feature convolution module is used for multi-directional decomposition and an adaptive residual frequency domain attention module to improve feature discrimination capability. A high-quality pseudo bounding box is generated by a multi-class token-driven transformer and fused with selective search candidate boxes to generate a hybrid candidate box set.
It effectively suppresses speckle noise, improves candidate box quality and feature discrimination ability, enhances detection accuracy and consistency, and alleviates the problem of error accumulation in weakly supervised training.
Smart Images

Figure CN122244543A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and remote sensing image processing technology, specifically relating to a SAR image target detection technology, particularly weakly supervised target detection of SAR images based on frequency domain enhancement and hybrid candidate box generation. Background Technology
[0002] Synthetic Aperture Radar (SAR), with its active microwave imaging mechanism, possesses all-weather, all-day Earth observation capabilities, playing an irreplaceable role in disaster monitoring, marine target identification, and other fields. With the development of deep learning technology, detection methods based on convolutional neural networks have been widely introduced into SAR target detection tasks. However, the coherent imaging mechanism of SAR images results in a large amount of inherent speckle noise and background clutter, blurring the boundary between targets and the background, posing a significant challenge to automatic detection.
[0003] Most existing SAR target detection methods rely on supervised training with a large number of accurate bounding box annotations. However, SAR image annotation requires specialized interpretation capabilities, which is extremely costly and severely restricts the construction of large-scale detection datasets. Weakly supervised target detection relies solely on image-level labels for training, effectively alleviating the annotation burden, but its performance is highly dependent on the quality of candidate regions and the discriminative power of feature representations. Existing weakly supervised SAR image detection methods suffer from the following main problems: First, the candidate bounding boxes are of low quality. Existing weakly supervised detection methods typically employ candidate bounding box generation methods based on low-level visual features, such as selective search. In SAR images, speckle noise and strong scattering backgrounds are easily misclassified as potential targets, causing a large number of candidate bounding boxes to shift to areas with strong background response. This significantly reduces the proportion of effective candidates and accumulates a large number of false positive samples during the aggregation and refinement process of multi-instance learning.
[0004] Second, the feature discrimination capability is insufficient. When performing feature extraction in the spatial domain, convolutional neural networks struggle to effectively distinguish between the target's structural scattering response and speckle noise components. Existing wavelet transform-based frequency domain enhancement methods can only provide decomposition in three fixed directions: horizontal, vertical, and diagonal. This limits their ability to capture the rich directional edge structures of synthetic aperture radar targets, resulting in insufficient stability in feature discrimination.
[0005] Therefore, designing a synthetic aperture radar weakly supervised target detection method that can effectively suppress speckle noise, improve candidate box quality, and enhance feature discrimination capability using only image-level labels is a key technical problem that urgently needs to be solved in this field. Summary of the Invention
[0006] To address the aforementioned technical problems, this invention provides a weakly supervised target detection method for SAR images based on frequency domain enhancement and hybrid candidate box generation. This method constructs a contour wave frequency domain feature convolution module, utilizing the multi-scale, multi-directional decomposition characteristics of contour wave transform to achieve speckle noise suppression and target directional edge structure enhancement; it further improves feature discrimination stability by combining an adaptive residual frequency domain attention module; simultaneously, it generates class-discriminative pseudo-boundary boxes through a multi-class token-driven transformer architecture, which are complementary and fused with selectively searched candidate boxes, mitigating the error accumulation problem in weakly supervised training at the candidate box level.
[0007] The technical solution adopted in this invention is: weakly supervised target detection of SAR images based on frequency domain enhancement and hybrid candidate box generation, characterized by the following steps: Step 1: Obtain the SAR image dataset and perform preprocessing; Step 101: Obtain the SAR image dataset, where only image-level labels are provided for the SAR images. , Indicates the target category number. The total number of categories, image size is ; Step 102: Perform multi-scale enhancement preprocessing on the SAR image, combine it with random horizontal flipping to obtain the expanded dataset, and divide the dataset into training set and test set; Step 2: Construct a multi-class token transformer module to generate class-discriminative pseudo-boundary boxes; Step 201: Input SAR image Evenly divided into Non-overlapping image patches, among which Each piece is the size of Pixels, mapped by linear projection to The dimensional embedding vectors yield the image patch token matrix. ,in ; Step 202: Initialization Learnable category tokens Concatenate with image patch tokens and add learnable positional encoding Then the complete input sequence is obtained. ; Input sequence Features are extracted by a 12-layer transformer, each layer containing multi-head self-attention and a feedforward network. The class tokens output from the last layer are average-pooled to generate image-level predictions. By using multi-label binary cross-entropy loss and image-level label supervision ,in For the first Class prediction probability; Step 203: Extract the average attention weights of all attention heads in the last 4 layers of the transformer to obtain the attention weight matrix from category tokens to image patch tokens. The image is reshaped into a two-dimensional spatial form and upsampled to the original resolution using bilinear interpolation to obtain a class-specific spatial response map. ; Step 204: Analyze the response graph Use the 80th percentile as the adaptive threshold Binarization Perform connected component analysis and filter out noisy regions. For each retained region, calculate the minimum bounding rectangle as a pseudo-boundary box. Use the mean of the response maps within the box as the confidence level to obtain the set of pseudo-boundary boxes. ; Step 3: Merge the pseudo bounding boxes with the selective search candidate boxes to generate a hybrid candidate box set; Step 301: Run a selective search algorithm on the input SAR image to generate a set of candidate boxes based on grayscale, texture, and spatial similarity. Each candidate box is assigned a default confidence level of 0.5. Step 302: Set up the pseudo-boundary box With selective search candidate box set Merge, remove redundant boxes using nonmaximum suppression, and retain the highest confidence level boxes. Each box is used as a set of mixed candidate boxes. ; Step 4: Construct a contour wave frequency domain feature convolution module to enhance the input features in multiple directions in the frequency domain; Step 401: Process the input feature map Perform a three-layer contour wave decomposition, each layer consisting of a cascaded Laplace pyramid decomposition and a directional filter bank. The layer decomposition process is as follows , ;in Represents the tensor outer product. and These represent 2x downsampling and upsampling, respectively. , For row and column direction low-pass filters, , For row and column direction interpolation filters, After three layers of decomposition, three bandpass margins were obtained. , , and the final low-frequency subband ; Step 402: For each layer of bandpass margin Apply a 4-directional filter bank to decompose it into 4 directional subbands. ;in For the first Each directional filter corresponds to a directional angle. Edge structures at 0°, 45°, 90°, and 135° were captured, and a total of 13 sub-bands were obtained from the complete 3-layer profile wave decomposition. and Indicates coordinate position; Step 403: Perform independent learnable convolution operations on each subband, for the first... Layer Each directional sub-band defines an independent convolutional kernel. The number of output channels The processing procedure is as follows For the final low-frequency subband Define independent convolution kernels The processing procedure is as follows ,in For batch normalization operations, It is a linear rectification activation function; Step 404: Reconstruct each sub-band back to the spatial domain using inverse contourlet transform; firstly, combine the four directional sub-bands into a bandpass margin using an inverse directional filter bank. ,in To correspond to the integrated filter, then with As The initial values are reconstructed layer by layer using the inverse Laplace pyramid. The complete inverse transform is represented as ,in Inverse contourlet transform, reconstructing features With input Maintain the same spatial dimensions; Step 5: Construct an adaptive residual frequency domain attention module to further improve the stability of feature discrimination; Step 501: Given the output features of the contour wave frequency domain feature convolution module It generates query, key, and value features through three parallel branches. ,in For layer normalization, Change the number of channels from Down to , For depthwise separable convolution, query features Key features Value characteristics ; Step 502: Transform the query and key features to the frequency domain using Fast Fourier Transform (FFT), calculate the frequency domain correlation using element-wise complex conjugate multiplication, and then return to the spatial domain using Inverse FFT to obtain the attention map. ;in express The complex conjugate, For element-wise complex multiplication, For Fast Fourier Transform, Inverse Fast Fourier Transform; normalization processing is performed on the attention map. , ,in and They are respectively Mean and variance in the spatial dimension It is a numerically stable term; Step 503: Refining Features through Restore channel number to Then, the residuals are fused to the output through parameterized weighted connection. ;in The learnable gated parameter matrix is initialized as a matrix of all 1s and automatically optimized during training, resulting in the final output. This refers to the characteristics after dual-frequency domain enhancement; Step 6: Construct a multi-instance learning detection network and complete weakly supervised training using hybrid candidate boxes and enhanced features; Step 601: Output features With mixed candidate box set The data is fed into a multi-instance learning detection network, where region-of-interest pooling is used to extract features of a fixed-size region. These features are then fed into the classification and detection branches, and after normalization, they are multiplied element-wise to obtain the fusion region score. Step 602: The region scores are summed and aggregated into image-level predictions. Calculate the multi-instance learning loss The iterative optimization of the three cascaded refining branches is supervised by integrating the region scores as soft pseudo-labels, and the refining loss is calculated. The average output of the three refining branches is used as a soft pseudo-label for monitoring in the distillation branch, and distillation loss is calculated. The total loss function is ,in This is the balance coefficient; Step 603: Use a stochastic gradient descent optimizer, a cosine annealing learning rate strategy, a batch size of 2, and train for 12 epochs to obtain a fully trained weakly supervised object detection network. Step 7: Use the trained model to predict the test set data to obtain the target detection results; Step 701: Input the test set images into the trained network, perform multi-scale enhancement on each image 10 times, and average and aggregate the detection outputs of all enhancement results to obtain the final target detection result.
[0008] Compared with the prior art, the present invention has the following advantages: 1. This invention proposes a contour wave frequency domain feature convolution module, which maps input features to 13 sub-bands through 3 layers of contour wave decomposition with 4 directions per layer, and applies independent learnable convolutions to each sub-band. Compared with existing frequency domain enhancement methods based on Haar wavelet transform, which can only provide decomposition in 3 fixed directions, the contour wave transform, through the cascade of Laplacian pyramids and directional filter banks, can perform differentiated modeling of target edge structures in four directions: 0°, 45°, 90°, and 135°, and has stronger directional feature representation capabilities, effectively reducing the interference of background scattering on instance selection in the multi-instance learning stage.
[0009] 2. This invention designs an adaptive residual frequency domain attention module, which calculates the global correlation between features in the frequency domain through Fast Fourier Transform and introduces a learnable gating parameter matrix to adaptively adjust the contribution of enhanced features. The computational complexity of this module is far lower than that of standard spatial domain self-attention, efficiently capturing long-range dependencies on high-resolution feature maps. It effectively complements the contour wave frequency domain feature convolution module, providing local multi-directional enhancement and global frequency domain modeling.
[0010] 3. This invention introduces a hybrid candidate box generation module, which learns the class discriminative spatial response map and generates high-quality pseudo bounding boxes through a multi-class token-driven transformer architecture. After complementary fusion with selective search candidate boxes, the modules are filtered by non-maximum suppression, which effectively improves the localization accuracy and class consistency of candidate regions. This provides a more reliable positive sample basis for subsequent cascaded refinement branches and alleviates the problem of error chain accumulation in weakly supervised training from the candidate box level.
[0011] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached image description: Figure 1 This is a general framework diagram of the present invention. Detailed implementation method: The method of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0012] It should be noted that, unless otherwise specified, the embodiments and attributes described in this application can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0013] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the exemplary embodiments according to this application. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0014] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0015] For ease of description, spatial relative terms such as "above," "on top of," "on the upper surface of," "above," etc., are used herein to describe the spatial positional relationship of a device or feature as shown in the figures to other devices or features. It should be understood that spatial relative terms are intended to encompass different orientations in use or operation beyond the orientation of the device as described in the figures. For example, if the device in the figures were inverted, a device described as "above" or "on top of" other devices or structures would subsequently be positioned as "below" or "under" other devices or structures. Thus, the exemplary term "above" can include both "above" and "below." The device may also be positioned in other different ways (rotated 90 degrees or in other orientations), and the spatial relative descriptions used herein will be interpreted accordingly.
[0016] like Figure 1 As shown, the present invention includes the following steps: Step 1: Obtain the SAR image dataset and perform preprocessing; Step 101: Obtain the SAR image dataset, where only image-level labels are provided for the SAR images. , Indicates the target category number. The total number of categories, image size is ; Step 102: Perform multi-scale enhancement preprocessing on the SAR image, combine it with random horizontal flipping to obtain the expanded dataset, and divide the dataset into training set and test set; Step 2: Construct a multi-class token transformer module to generate class-discriminative pseudo-boundary boxes; Step 201: Input SAR image Evenly divided into Non-overlapping image patches, among which Each piece is the size of Pixels, mapped by linear projection to The dimensional embedding vectors yield the image patch token matrix. ,in ; Step 202: Initialization Learnable category tokens Concatenate with image patch tokens and add learnable positional encoding Then the complete input sequence is obtained. ; Input sequence Features are extracted by a 12-layer transformer, each layer containing multi-head self-attention and a feedforward network. The class tokens output from the last layer are average-pooled to generate image-level predictions. By using multi-label binary cross-entropy loss and image-level label supervision ,in For the first Class prediction probability; Step 203: Extract the average attention weights of all attention heads in the last 4 layers of the transformer to obtain the attention weight matrix from category tokens to image patch tokens. The image is reshaped into a two-dimensional spatial form and upsampled to the original resolution using bilinear interpolation to obtain a class-specific spatial response map. ; Step 204: Analyze the response graph Use the 80th percentile as the adaptive threshold Binarization Perform connected component analysis and filter out noisy regions. For each retained region, calculate the minimum bounding rectangle as a pseudo-boundary box. Use the mean of the response maps within the box as the confidence level to obtain the set of pseudo-boundary boxes. ; Step 3: Merge the pseudo bounding boxes with the selective search candidate boxes to generate a hybrid candidate box set; Step 301: Run a selective search algorithm on the input SAR image to generate a set of candidate boxes based on grayscale, texture, and spatial similarity. Each candidate box is assigned a default confidence level of 0.5. Step 302: Set up the pseudo-boundary box With selective search candidate box set Merge, remove redundant boxes using nonmaximum suppression, and retain the highest confidence level boxes. Each box is used as a set of mixed candidate boxes. ; Step 4: Construct a contour wave frequency domain feature convolution module to enhance the input features in multiple directions in the frequency domain; Step 401: Process the input feature map Perform a three-layer contour wave decomposition, each layer consisting of a cascaded Laplace pyramid decomposition and a directional filter bank. The layer decomposition process is as follows , ;in Represents the tensor outer product. and These represent 2x downsampling and upsampling, respectively. , For row and column direction low-pass filters, , For row and column direction interpolation filters, After three layers of decomposition, three bandpass margins were obtained. , , and the final low-frequency subband ; Step 402: For each layer of bandpass margin Apply a 4-directional filter bank to decompose it into 4 directional subbands. ;in For the first Each directional filter corresponds to a directional angle. Edge structures at 0°, 45°, 90°, and 135° were captured, and a total of 13 sub-bands were obtained from the complete 3-layer profile wave decomposition. and Indicates coordinate position; Step 403: Perform independent learnable convolution operations on each subband, for the first... Layer Each directional sub-band defines an independent convolutional kernel. The number of output channels The processing procedure is as follows For the final low-frequency subband Define independent convolution kernels The processing procedure is as follows ,in For batch normalization operations, It is a linear rectification activation function; Step 404: Reconstruct each sub-band back to the spatial domain using inverse contourlet transform; firstly, combine the four directional sub-bands into a bandpass margin using an inverse directional filter bank. ,in To correspond to the integrated filter, then with As The initial values are reconstructed layer by layer using the inverse Laplace pyramid. The complete inverse transform is represented as ,in Inverse contourlet transform, reconstructing features With input Maintain the same spatial dimensions; Step 5: Construct an adaptive residual frequency domain attention module to further improve the stability of feature discrimination; Step 501: Given the output features of the contour wave frequency domain feature convolution module It generates query, key, and value features through three parallel branches. ,in For layer normalization, Change the number of channels from Down to , For depthwise separable convolution, query features Key features Value characteristics ; Step 502: Transform the query and key features to the frequency domain using Fast Fourier Transform (FFT), calculate the frequency domain correlation using element-wise complex conjugate multiplication, and then return to the spatial domain using Inverse FFT to obtain the attention map. ;in express The complex conjugate, For element-wise complex multiplication, For Fast Fourier Transform, Inverse Fast Fourier Transform; normalization processing is performed on the attention map. , ,in and They are respectively Mean and variance in the spatial dimension It is a numerically stable term; Step 503: Refining Features through Restore channel number to Then, the residuals are fused to the output through parameterized weighted connection. ;in The learnable gated parameter matrix is initialized as a matrix of all 1s and automatically optimized during training, resulting in the final output. This refers to the characteristics after dual-frequency domain enhancement; Step 6: Construct a multi-instance learning detection network and complete weakly supervised training using hybrid candidate boxes and enhanced features; Step 601: Output features With mixed candidate box set The data is fed into a multi-instance learning detection network, where region-of-interest pooling is used to extract features of a fixed-size region. These features are then fed into the classification and detection branches, and after normalization, they are multiplied element-wise to obtain the fusion region score. Step 602: The region scores are summed and aggregated into image-level predictions. Calculate the multi-instance learning loss The iterative optimization of the three cascaded refining branches is supervised by integrating the region scores as soft pseudo-labels, and the refining loss is calculated. The average output of the three refining branches is used as a soft pseudo-label for monitoring in the distillation branch, and distillation loss is calculated. The total loss function is ,in This is the balance coefficient; Step 603: Use a stochastic gradient descent optimizer, a cosine annealing learning rate strategy, a batch size of 2, and train for 12 epochs to obtain a fully trained weakly supervised object detection network. Step 7: Use the trained model to predict the test set data to obtain the target detection results; Step 701: Input the test set images into the trained network, perform multi-scale enhancement on each image 10 times, and average and aggregate the detection outputs of all enhancement results to obtain the final target detection result.
[0017] The above description is merely an embodiment of the present invention and is not intended to limit the present invention in any way. Any simple modifications, alterations, or equivalent structural changes made to the above embodiments based on the technical essence of the present invention shall still fall within the protection scope of the present invention.
Claims
1. A method for weakly supervised target detection in SAR images based on frequency domain enhancement and hybrid candidate box generation, characterized in that, Includes the following steps: Step 1: Obtain the SAR image dataset and perform preprocessing; Step 101: Obtain the SAR image dataset, where only image-level labels are provided for the SAR images. , Indicates the target category number. The total number of categories, image size is ; Step 102: Perform multi-scale enhancement preprocessing on the SAR image, combine it with random horizontal flipping to obtain the expanded dataset, and divide the dataset into training set and test set; Step 2: Construct a multi-class token transformer module to generate class-discriminative pseudo-boundary boxes; Step 201: Input SAR image Evenly divided into Non-overlapping image patches, among which Each piece is the size of Pixels, mapped by linear projection to The image patch token matrix is obtained from the dimensional embedding vector. ,in ; Step 202: Initialization Learnable category tokens Concatenate with image patch tokens and add learnable positional encoding Then the complete input sequence is obtained. ; Input sequence Features are extracted by a 12-layer transformer, each layer containing multi-head self-attention and a feedforward network. The class tokens output from the last layer are average-pooled to generate image-level predictions. By using multi-label binary cross-entropy loss and image-level label supervision ,in For the first Class prediction probability; Step 203: Extract the average attention weights of all attention heads in the last 4 layers of the transformer to obtain the attention weight matrix from category tokens to image patch tokens. The image is reshaped into a two-dimensional spatial form and upsampled to the original resolution using bilinear interpolation to obtain a class-specific spatial response map. ; Step 204: Analyze the response graph Use the 80th percentile as the adaptive threshold Binarization Perform connected component analysis and filter out noisy regions. For each retained region, calculate the minimum bounding rectangle as a pseudo-boundary box. Use the mean of the response maps within the box as the confidence level to obtain the set of pseudo-boundary boxes. ; Step 3: Merge the pseudo bounding boxes with the selective search candidate boxes to generate a hybrid candidate box set; Step 301: Run a selective search algorithm on the input SAR image to generate a set of candidate boxes based on grayscale, texture, and spatial similarity. Each candidate box is assigned a default confidence level of 0.
5. Step 302: Set up the pseudo-boundary box With selective search candidate box set Merge, remove redundant boxes using nonmaximum suppression, and retain the highest confidence level boxes. Each box is used as a set of mixed candidate boxes. ; Step 4: Construct a contour wave frequency domain feature convolution module to enhance the input features in multiple directions in the frequency domain; Step 401: Process the input feature map Perform a three-layer contour wave decomposition, each layer consisting of a cascaded Laplace pyramid decomposition and a directional filter bank. The layer decomposition process is as follows , ;in Represents the tensor outer product. and These represent 2x downsampling and upsampling, respectively. , For row and column direction low-pass filters, , For row and column direction interpolation filters, After three layers of decomposition, three bandpass margins were obtained. , , and the final low-frequency subband ; Step 402: For each layer of bandpass margin Apply a 4-directional filter bank to decompose it into 4 directional subbands. ;in For the first Each directional filter corresponds to a directional angle. Edge structures at 0°, 45°, 90°, and 135° were captured, and a total of 13 subbands were obtained from the complete 3-layer profile wave decomposition. and Indicates coordinate position; Step 403: Perform independent learnable convolution operations on each subband, for the first... Layer Each directional sub-band defines an independent convolutional kernel. The number of output channels The processing procedure is as follows For the final low-frequency subband Define independent convolution kernels The processing procedure is as follows ,in For batch normalization operations, It is a linear rectification activation function; Step 404: Reconstruct each sub-band back to the spatial domain using inverse contourlet transform; firstly, combine the four directional sub-bands into a bandpass margin using an inverse directional filter bank. ,in To correspond to the integrated filter, then with As The initial values are reconstructed layer by layer using the inverse Laplace pyramid. The complete inverse transform is represented as ,in Inverse contourlet transform, reconstructing features With input Maintain the same spatial dimensions; Step 5: Construct an adaptive residual frequency domain attention module to further improve the stability of feature discrimination; Step 501: Given the output features of the contour wave frequency domain feature convolution module It generates query, key, and value features through three parallel branches. ,in For layer normalization, Change the number of channels from Down to , For depthwise separable convolution, query features Key features Value characteristics ; Step 502: Transform the query and key features to the frequency domain using Fast Fourier Transform (FFT), calculate the frequency domain correlation using element-wise complex conjugate multiplication, and then return to the spatial domain using Inverse FFT to obtain the attention map. ;in express The complex conjugate, For element-wise complex multiplication, For Fast Fourier Transform, Inverse Fast Fourier Transform; normalization processing is performed on the attention map. , ,in and They are respectively Mean and variance in the spatial dimension It is a numerically stable term; Step 503: Refining Features through Restore channel number to Then, the residuals are fused to the output through parameterized weighted connection. ;in The learnable gated parameter matrix is initialized as a matrix of all 1s and automatically optimized during training, resulting in the final output. This refers to the characteristics after dual-frequency domain enhancement; Step 6: Construct a multi-instance learning detection network and complete weakly supervised training using hybrid candidate boxes and enhanced features; Step 601: Output features With mixed candidate box set The data is fed into a multi-instance learning detection network, where region-of-interest pooling is used to extract features of a fixed-size region. These features are then fed into the classification and detection branches, and after normalization, they are multiplied element-wise to obtain the fusion region score. Step 602: The region scores are summed and aggregated into image-level predictions. Calculate the multi-instance learning loss The iterative optimization of the three cascaded refining branches is supervised by integrating the region scores as soft pseudo-labels, and the refining loss is calculated. The average output of the three refining branches is used as a soft pseudo-label for monitoring in the distillation branch, and distillation loss is calculated. The total loss function is ,in This is the balance coefficient; Step 603: Use a stochastic gradient descent optimizer, a cosine annealing learning rate strategy, a batch size of 2, and train for 12 epochs to obtain a fully trained weakly supervised object detection network. Step 7: Use the trained model to predict the test set data to obtain the target detection results; Step 701: Input the test set images into the trained network, perform multi-scale enhancement on each image 10 times, and average and aggregate the detection outputs of all enhancement results to obtain the final target detection result.