A citrus detection method using a frequency domain aggregated attention mechanism and a multi-scale encoder

CN120375087BActive Publication Date: 2026-06-26TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2025-04-24
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

然而,现有方法在处理小目标和遮挡目标时仍存在局限性,主要体现在特征提取能力不足、上下文信息利用不充分以及对高频细节信息的保留不够

Benefits of technology

[0027]1. This invention constructs a self-picked citrus detection dataset containing occlusions and small fruits, providing a crucial data foundation for model training. A Frequency Aggregation Attention Network (FAN) is designed to enhance the model's sensitivity to the frequency features of small and occluded targets, effectively capturing subtle features in complex scenes and improving its ability to extract fuzzy and complex features.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120375087B_ABST
    Figure CN120375087B_ABST
Patent Text Reader

Abstract

The present disclosure relates to the technical field of citrus detection in agricultural automation, and particularly relates to a citrus detection method based on a frequency domain aggregation attention mechanism and a multi-scale encoder. The method proposes an innovative detection framework for small target and occluded target detection problems, including a frequency aggregation attention network (FAN) and a multi-scale Transformer encoder. The frequency aggregation attention network decomposes the feature map through two-dimensional discrete wavelet transform to enhance the frequency domain features of small targets and occluded targets; the multi-scale Transformer encoder combines convolution feature pyramid operation and retains high-frequency detail information through a wavelet fusion module. In addition, an IoU-aware query selection and an optimized loss function are used to significantly improve detection accuracy. Through frequency domain feature enhancement and multi-scale feature fusion, the present disclosure realizes high-precision detection of citrus fruits, especially small targets and occluded targets, and is suitable for fruit grading, yield statistics and other scenes in the field of intelligent agriculture.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of citrus detection technology in agricultural automation, and in particular to a citrus detection method based on a frequency domain aggregation attention mechanism and a multi-scale encoder using a Transformer architecture. Background Technology

[0002] Citrus fruit detection, a crucial component of intelligent agricultural management and automated robotics, aims to achieve efficient and accurate identification and localization of citrus fruits using advanced image processing and machine learning technologies. In the process of intelligent orchard management and agricultural automation, citrus detection technology is increasingly widely applied, encompassing fruit quantity counting, pest and disease prediction, and robotic harvesting. However, the complexity of the natural environment and the diversity of fruit growth patterns make the detection of small and shaded citrus fruits a significant technical challenge.

[0003] Traditional citrus detection methods primarily rely on handcrafted feature-based object detection algorithms, such as template matching methods based on color, shape, and texture features. While these methods achieve certain detection results under specific conditions, their accuracy and robustness significantly decline when faced with complex backgrounds, fruit occlusion, and changes in lighting. In recent years, the rise of deep learning technology has brought new breakthroughs to citrus detection. Object detection algorithms based on convolutional neural networks (CNNs), such as YOLO and Faster R-CNN, can effectively improve detection accuracy and speed by learning feature patterns in the data. However, existing methods still have limitations when dealing with small and occluded targets, mainly in terms of insufficient feature extraction capabilities, inadequate utilization of contextual information, and insufficient preservation of high-frequency detail information.

[0004] Furthermore, the scarcity of existing citrus detection datasets further limits the training performance of the models. Most publicly available datasets contain only a small number of labeled images with low annotation accuracy, making it difficult to meet the detection needs in complex scenarios. Dedicated datasets for citrus harvesting tasks are even more lacking, which limits the generalization ability of the models in practical applications. Therefore, developing a citrus detection method that can effectively handle small and occluded targets, and constructing a high-quality citrus detection dataset, is of great significance for promoting the intelligent development of the citrus industry. Summary of the Invention

[0005] This invention discloses a citrus detection method using a frequency-domain convergent attention mechanism and a multi-scale encoder, aiming to solve the challenge of accurate detection of small and occluded citrus fruits. First, a self-collected citrus detection dataset containing occluded and small fruits is constructed, providing a crucial data foundation for model training. Second, a frequency convergent attention network (FAN) is designed to enhance the model's sensitivity to the frequency features of small and occluded targets, effectively capturing subtle features in complex scenes. Simultaneously, an efficient multi-scale Transformer encoder is innovatively developed, and a Haar wavelet fusion module is introduced to eliminate aliasing effects during upsampling and preserve high-frequency details of small and occluded targets. Furthermore, a multi-level supervision strategy is employed, combining IoU-aware query selection and dynamic loss function optimization by the decoder, significantly improving model convergence speed and detection accuracy. This invention, through multi-dimensional technological innovation, achieves high-precision detection of citrus fruits, especially small and occluded targets, and is applicable to scenarios such as fruit grading and yield statistics in smart agriculture.

[0006] To achieve the above objectives, the present invention adopts the following technical solution:

[0007] Step 1: Construct a self-collected citrus detection dataset, which includes citrus images of occluded and small fruits. The image resolution is 3456×3456, and 47819 citrus targets are labeled.

[0008] Step 2: Preprocess the image data obtained in Step 1 to obtain the preprocessed citrus image;

[0009] Step 3: Design a Frequency Convergence Attention Network (FAN) to perform a two-dimensional discrete wavelet transform (2DDWT) on the input feature map, decompose it into different frequency sub-bands, and compress information in the channel dimension and frequency dimension respectively to enhance the frequency features of small targets and occluded targets.

[0010] Step 4: Develop a multi-scale Transformer encoder, combining self-attention mechanism and convolutional feature pyramid operation, and perform cross-scale feature fusion through Haar wavelet fusion module to preserve high-frequency details;

[0011] Step 5: Use IoU-aware query selection and decoder to optimize the loss function and improve the model's convergence speed and detection accuracy.

[0012] Step 6: Import the trained network model and input the test set into the network to test its performance.

[0013] The invention is further characterized by:

[0014] Furthermore, the specific process of step 3 is as follows:

[0015] The input feature map is decomposed by two-dimensional discrete wavelet transform (2DDWT) to obtain low-frequency and high-frequency sub-bands.

[0016] For each frequency sub-band, channel-dimensional compression is performed, and a large convolutional kernel is used to expand the receptive field and enhance the contextual semantics. Global average pooling is then performed on the frequency sub-bands, and weights are calculated through a dimensionality-reducing fully connected layer and an activation layer. Finally, a scaling operation is performed to obtain the enhanced feature map.

[0017] Step 4 is as follows:

[0018] First, a top-down path is used to align multi-scale features through 1×1 convolutions, ensuring consistency of feature maps at different scales in the channel dimension, thereby providing a unified feature representation for subsequent feature fusion.

[0019] Then, following a bottom-up approach: the feature map is deepened through 3×3 convolution, expanding the receptive field and enhancing the expressive power of the features, thereby extracting more discriminative feature information and improving the model's detection accuracy of the target.

[0020] In the dual-path feature fusion process, a novel fusion module, the Haar wavelet fusion module, is proposed, as follows:

[0021] The Haar wavelet fusion module uses Haar wavelet transform to decompose large-scale features, concatenates them with small-scale features to enhance low-frequency information, and obtains the final fused output through residual operations and inverse wavelet transform.

[0022] Furthermore, the loss function described in step 5 includes:

[0023] Bounding box regression loss and classification loss The classification loss incorporates IoU information to minimize the difference between the confidence score and the localization, thereby promoting model convergence.

[0024] ;

[0025] in, b represents the ground truth bounding box, and b represents the predicted bounding box. C represents the true class, C represents the predicted class, and IoU represents the intersection-union ratio of the predicted bounding boxes.

[0026] The beneficial effects of this invention are:

[0027] 1. This invention constructs a self-picked citrus detection dataset containing occlusions and small fruits, providing a crucial data foundation for model training. A Frequency Aggregation Attention Network (FAN) is designed to enhance the model's sensitivity to the frequency features of small and occluded targets, effectively capturing subtle features in complex scenes and improving its ability to extract fuzzy and complex features.

[0028] 2. This invention develops a high-efficiency multi-scale Transformer encoder and introduces a Haar wavelet fusion module to eliminate aliasing during upsampling and preserve high-frequency details of small and occluded targets. A multi-level supervision strategy, combined with IoU-aware query selection and decoder-dynamically optimized loss function, significantly improves model convergence speed and detection accuracy. Through multi-dimensional technological innovation, this invention achieves high-precision detection of citrus fruits, especially small and occluded targets, and is applicable to scenarios such as fruit grading and yield statistics in smart agriculture. Attached Figure Description

[0029] Figure 1 The overall network framework of this invention significantly improves the detection performance of small and occluded citrus fruits;

[0030] Figure 2 This is a schematic diagram of the Frequency Convergence Attention Network (FAN) structure, which enhances the model's frequency domain sensitivity.

[0031] Figure 3 This is a schematic diagram of the Haar wavelet fusion module structure, which reduces frequency domain aliasing and efficiently fuses multi-scale features.

[0032] Figure 4 This is a schematic diagram of the target detection results for a self-collected dataset. The model's false detection rate and false negative rate have been significantly reduced. Detailed Implementation

[0033] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without creative effort are within the scope of protection of this disclosure. Constructing a citrus target detector includes dataset construction, attention mechanism design, encoder network structure reconstruction, loss function optimization, etc.

[0034] Dataset Preparation: A self-collected citrus dataset was constructed, containing 1388 images of citrus canopy layers, covering various citrus varieties. The image resolution is 3456×3456, and 47819 citrus targets are labeled. The dataset is divided into training, validation, and test sets in a ratio of 3:1:1.

[0035] To address the shortcomings of existing technologies, this invention discloses a citrus detection method using a frequency-domain aggregation attention mechanism and a multi-scale encoder, such as... Figure 1 As shown, the method includes,

[0036] Step 1: Construct an attention mechanism to extract subtle and occluded citrus frequency domain features;

[0037] Step 2: Use top-down and bottom-up feature fusion paths in the encoder;

[0038] Step 3: Construct a new feature fusion module in a top-down approach.

[0039] In a specific example of this disclosure, for step 1, as Figure 2 As shown, the specific operation is as follows: A two-dimensional discrete wavelet transform is performed on the input feature map, decomposing it into a low-frequency sub-band (LL) and a high-frequency sub-band (LH, HL, HH). The low-frequency sub-band preserves the contour information of the image, while the high-frequency sub-band preserves the detailed information of the image.

[0040] ;

[0041] in, and These are the low-pass and high-pass filters for 2DDWT, respectively. Indicates the input feature map, and These represent downsampling operations along rows and columns, respectively. Represents the four sub-subs after wavelet transform;

[0042] Channel-dimensional compression is performed on each frequency sub-band. Large convolutional kernels (e.g., 17×17) are used to expand the receptive field, enhancing the contextual semantic understanding of small and occluded targets.

[0043] ;

[0044] in, Indicates a fully connected operation. Indicates global average pooling, X i It is a subband after wavelet decomposition.

[0045] Y,S i The intermediate features are represented by global average pooling of the frequency subbands. Weights are calculated through a dimensionality-reducing fully connected layer and an activation layer. Finally, a scaling operation is performed to obtain the enhanced feature map.

[0046] ;

[0047] in, This indicates that the incentive layer calculates the weights, and C represents the cascading operation. This indicates a scaling operation. It is the enhanced feature map;

[0048] For step 2, the specific operation is as follows: reconstruct the multi-scale Transformer encoder, and through dual-path multi-level feature fusion, enable the model to fully extract target location information and semantic details;

[0049] like Figure 2 As shown, the multi-scale Transformer encoder performs multi-scale feature fusion through two paths: top-down and bottom-up.

[0050] Top-down approach: Align multi-scale features through 1×1 convolution to ensure consistency of features at different scales;

[0051] Bottom-up approach: Deepen the feature map using 3×3 convolutions to enhance the expressive power of the features;

[0052] In the multi-scale Transformer encoder, the Haar wavelet fusion module further eliminates the frequency domain aliasing problem caused by upsampling and continuously extracts the frequency domain information of the features;

[0053] For step 3, such as Figure 3 As shown, the specific operation is as follows: Large-scale features are decomposed using Haar wavelet transform and concatenated with small-scale features to enhance low-frequency information. The final fused output is obtained through residual operations and inverse wavelet transform (2DIWT), preserving high-frequency details.

[0054] ;

[0055] in, Represents large-scale feature maps at lower levels. E represents a small-scale feature map at a higher level. n Represents n enhancement modules, Indicates intermediate features, Indicates the final fused output;

[0056] Finally, the IoU-aware decoder and loss function are optimized.

[0057] ;

[0058] in, To measure the difference between the predicted bounding box and the true bounding box, the IoU loss function is used to directly optimize the localization accuracy of the bounding box;

[0059] This approach utilizes the cross-entropy loss function to optimize the classification results. IoU information is incorporated into the classification loss to minimize the difference between the confidence score and the localization, thus promoting model convergence.

[0060] b represents the ground truth bounding box, and b represents the predicted bounding box. C represents the true category, C represents the predicted category, and IoU represents the intersection-union ratio of the predicted bounding boxes.

[0061] Simultaneously, a multi-level supervision strategy is employed, utilizing IoU-aware query selection and decoder to optimize the loss function. During training, features at different scales are supervised to ensure the model can accurately detect targets at various scales.

[0062] Evaluation indicators:

[0063] AP 50 This represents the average accuracy when the IoU (Intersection over Union) threshold is 0.50. IoU is the proportion of overlap between the predicted bounding box and the ground truth bounding box. AP50 measures the model's detection accuracy when the IoU is 0.50.

[0064] AP 75 : Indicates the average accuracy at an IoU threshold of 0.75. This is a more stringent evaluation criterion because a higher IoU threshold requires a larger overlap between the predicted bounding box and the ground truth bounding box.

[0065] AP 50-95 This represents the average accuracy across IoU thresholds ranging from 0.50 to 0.95. It is a comprehensive metric used to evaluate the overall performance of the model across different IoU thresholds.

[0066] AP S AP L AP M This indicates the average accuracy for small, medium, and large targets. Small targets are typically defined as objects with an area less than 32². Medium-sized targets are typically defined as objects with an area between 32² and 96². Large targets are typically defined as objects with an area greater than 96².

[0067] FPS: Frames per second (FPS) is the number of frames processed by the model during the inference phase, which is a metric for the model's real-time performance. A higher FPS indicates better algorithm performance.

[0068] GFLOPs: Represents one billion floating-point operations per second, used to measure the computational complexity and hardware requirements of a model. The higher the GFLOPs, the greater the computational load of the model and the higher the hardware requirements.

[0069] Experiment setup: The network architecture was implemented using PyTorch; Adam was used as the optimizer, with a total of 150 rounds.

[0070] The learning rate of the backbone network is set to 10. -4 The learning rate for both the encoder and decoder is set to 10. -5 Training was performed on an NVIDIA GeForce RTX 4090 with 24GB of RAM, and the training time was approximately 4 hours.

[0071] Comparative Experiments: Our proposed method was compared with several baseline methods and state-of-the-art object detection methods on a self-collected dataset. Table 1 shows the quantitative results on the dataset, where our method achieved the best performance in all cases. The best and second-best results are marked in bold in the table.

[0072] Table 1:

[0073] ;

[0074] Complexity analysis: This invention designs models with different complexities to provide more diverse choices. The experimental results are shown in Table 2.

[0075] Table 2 Comparison of the complexity of different models. * indicates the results measured in the experiment with a batch size of 1.

[0076] .

[0077] Furthermore, this invention visualizes the model detection results, such as... Figure 4 As shown in the figure. Ground Truth represents the true bounding box of the citrus target, and Prediction represents the model's predicted bounding box.

Claims

1. A method for citrus detection using a frequency-domain convergent attention mechanism and a multi-scale encoder, characterized in that, The specific steps are as follows: Step 1: Construct a self-harvested citrus detection dataset, including images of citrus fruits with occlusion and small fruits, as well as the locations of citrus bounding boxes; Step 2: Design a Frequency Convergence Attention Network (FAN) to perform a two-dimensional discrete wavelet transform (2DDWT) on the input feature map, decomposing it into different frequency sub-bands. Information compression is then performed in both the channel and frequency dimensions to enhance the frequency features of small and occluded targets. Specifically, this includes: Step 2.1: Perform two-dimensional discrete wavelet transform (2DDWT) decomposition on the input feature map to obtain low-frequency and high-frequency sub-bands, calculated using the following formula; in, and These are the low-pass and high-pass filters for 2DDWT, respectively. Indicates the input feature map, and These represent downsampling operations along rows and columns, respectively. This represents the four sub-bands after wavelet transform; Step 2.2: Compress the channel dimension for each frequency sub-band separately, and use large convolutional kernels to expand the receptive field and enhance the contextual semantics; Step 2.3: Perform global average pooling on the frequency subband, calculate the weights through the dimension reduction fully connected layer and the activation layer, and perform the final scaling operation to obtain the enhanced feature map; in, This represents the two-dimensional inverse wavelet transform. This indicates a large convolution kernel operation. This indicates a fully connected operation. Indicates global average pooling. This represents the activation operation in channel attention. Indicates a cascading operation. This indicates element-wise multiplication. Indicates intermediate features, This represents the enhanced feature map; Step 3: Develop a multi-scale Transformer encoder, combining self-attention mechanism and convolutional feature pyramid operation, and perform cross-scale feature fusion through Haar wavelet fusion module to preserve high-frequency details. Specifically, this includes: Step 3.1: Top-down and bottom-up paths are used to align and deepen multi-scale features using 1×1 and 3×3 convolutions, respectively. Step 3.2: The Haar wavelet fusion module uses Haar wavelet transform to decompose large-scale features, concatenates them with small-scale features to enhance low-frequency information, and obtains the final fused output through residual operations and inverse wavelet transform. ; in, Represents large-scale feature maps at lower levels. This represents a small-scale feature map representing a higher level. Represents n enhancement modules, Indicates intermediate features, Indicates the final fused output; Step 4: Use Intersection over Union (IoU) to perceive query selection and decoder, and optimize the loss function; Step 5: Import the trained network model and input the test set into the network to test its performance. Step 6: Use evaluation indicators to evaluate and analyze the test results.

2. The citrus detection method using a frequency domain convergent attention mechanism and a multi-scale encoder according to claim 1, characterized in that, The loss function includes: Step 4.1, Bounding box regression loss ( ) and classification loss ( The classification loss incorporates IoU information to minimize the difference between the confidence score and the localization, thus promoting model convergence. in, Represents the true bounding box. Indicates the predicted bounding box. Indicates the true category, Prediction categories, This represents the predicted bounding box intersection-union ratio.