A target detection method and device

By introducing a target detection method that combines color prior attention and rotation-aware position coding, the problems of inaccurate positioning, sensitivity to color interference, and weak detail capture in existing fruit detection technologies are solved. This achieves high-precision fruit detection and a lightweight network structure, making it suitable for automated agricultural harvesting systems.

CN122244852APending Publication Date: 2026-06-19GUILIN UNIV OF ELECTRONIC TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUILIN UNIV OF ELECTRONIC TECH
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing fruit detection technologies suffer from problems such as inaccurate positioning, sensitivity to color interference, weak ability to capture details, and overly general network structure when facing complex agricultural environments, making it difficult to meet the high-precision automatic harvesting needs of fruits and vegetables such as tomatoes.

Method used

We adopt a target detection method based on color prior attention and rotation-aware position coding. By using the StarNet backbone network, color prior spatial-channel attention module (CPSCA) and multi-level attention fusion strategy, combined with rotation-aware position coding module (RPE), we can improve feature representation ability and robustness.

Benefits of technology

It improves the positioning accuracy of rotating targets, enhances color discrimination capabilities, improves multi-scale fusion and occlusion detection capabilities, and realizes a lightweight and efficient network structure, which is suitable for resource-constrained agricultural application environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244852A_ABST
    Figure CN122244852A_ABST
Patent Text Reader

Abstract

This invention discloses a target detection method and device based on color prior attention and rotation-aware position encoding. The core steps of the method are: 1. Receiving RGB images as the basic data for feature extraction; 2. Extracting rich visual features from the image using a StarNet backbone network (containing depthwise separable convolutions and multi-stage modules) to enhance target representation in complex backgrounds; 3. Enhancing target region response and suppressing background noise by combining the CPSCA module with the tomato red channel prior; 4. Employing a multi-level attention fusion strategy to integrate deep and shallow features to improve expressive power; 5. Improving the accuracy of rotating target localization through a rotation-aware position encoding module; 6. Outputting a rotated bounding box to achieve high-precision detection of tomatoes in any direction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of crop fruit detection technology, and in particular to a target detection method and device based on color prior attention and rotation perception position encoding. Background Technology

[0002] With the continuous advancement of modern agriculture towards large-scale and intelligent operations, the reliance on automated identification and positioning technologies in fruit and vegetable harvesting is increasing. Tomatoes, as a vegetable crop with a short ripening cycle, high yield, and wide application, have their harvesting efficiency directly impacting economic benefits. Currently, manual harvesting remains the mainstream method, facing problems such as high labor intensity, low efficiency, and high costs, making it difficult to meet the needs of large-scale planting. Therefore, automatic fruit detection methods based on computer vision have become one of the key technologies for the development of intelligent agriculture.

[0003] In recent years, deep learning technology has achieved remarkable results in the field of fruit detection. By constructing end-to-end target detection models, it has effectively improved the recognition accuracy of fruits and vegetables such as tomatoes in complex environments. However, agricultural scenarios commonly present problems such as changes in lighting, occlusion by branches and leaves, overlapping fruits, and background interference, posing significant challenges to existing detection methods. Furthermore, tomato fruits often exhibit arbitrary distributions in their natural state, possessing geometric features such as density, tilt, and elongation. Existing detection frameworks mostly employ horizontal bounding boxes (HBB), which are prone to redundant enclosing when dealing with such complex targets, leading to inaccurate localization and affecting the performance of downstream tasks such as pose estimation and grasping path planning.

[0004] Furthermore, although existing research has attempted to introduce attention mechanisms to improve the model's ability to focus on key regions, it generally neglects the most discriminative prior information in agricultural images—color features. For example, tomato fruits have a typical red channel advantage, but existing network structures fail to effectively model color priors, leading to a significant drop in detection accuracy in cluttered backgrounds or environments with strong light interference. At the same time, most current methods only embed attention modules at specific network layers, lacking sufficient integration of shallow details and deep semantic features, thus limiting the recognition performance of small targets and occluded areas.

[0005] In terms of backbone network design, existing methods mostly use general networks (such as MobileNet, DarkNet, etc.) or lightweight structures as feature extraction skeletons. Although they have certain speed advantages, they do not fully consider the detection needs of dense targets, small scale and multiple interferences in agricultural scenarios, and show problems such as poor structural adaptability and insufficient detection accuracy.

[0006] In summary, current fruit and vegetable testing technologies still have the following prominent problems when applied to fruit harvesting scenarios:

[0007] (1) Most current target detection methods use horizontal bounding boxes for target localization, which is difficult to adapt to the arbitrary angles, postures and arrangements of fruits during natural growth. Especially in typical scenarios such as clustered tomatoes, the horizontal bounding boxes often suffer from redundant coverage and background interference, resulting in a decrease in localization accuracy;

[0008] (2) Although some detection networks have introduced attention mechanisms such as CBAM and GAM to improve feature extraction, color information, especially the distribution of the red channel of the fruit, as a key discriminative feature, has not been explicitly modeled or enhanced in agricultural images. This makes the model prone to false detections when dealing with color, resulting in insufficient robustness.

[0009] (3) Existing methods mostly embed attention modules only in specific single layers, lacking an effective cross-layer fusion mechanism, and fail to fully integrate shallow edge details and deep semantic features, thus limiting the model's ability to express small targets, occluded areas and edge areas, affecting detection integrity and accuracy;

[0010] (4) Currently widely used backbone networks (such as Darknet, MobileNet, etc.) are mostly general-purpose or lightweight designs. Although they have certain advantages in detection speed, they perform poorly in actual agricultural scenarios, such as complex lighting, overlapping fruits, and shading of branches and leaves. They lack structural adaptation and accuracy optimization for agricultural target characteristics.

[0011] In summary, existing fruit detection technologies still suffer from problems such as inaccurate localization, sensitivity to color interference, weak detail capture ability, and excessive versatility in network structure when detecting typical agricultural targets such as tomatoes. There is an urgent need to propose an improved detection method with stronger feature expression capabilities, high robustness, and accurate localization capabilities to meet the application needs of automated harvesting systems in actual agricultural production. Summary of the Invention

[0012] The present invention aims to solve at least one of the technical problems existing in the prior art or related art.

[0013] Therefore, the purpose of this invention is to provide a target detection method and device based on color prior attention and rotational perception position encoding, which can improve the detection capability of fruit targets and promote the application of computer vision technology in agricultural production.

[0014] To achieve the above objectives, the first aspect of the present invention provides a target detection method based on color prior attention and rotation-aware position encoding, comprising:

[0015] Receive RGB image I RGB ∈R H×W×3 , serving as the basic data for feature extraction in target fruit detection;

[0016] The RGB image is input into the optimized StarNet backbone network. Through Stem layers, N feature extraction stages, and iterative processing of multiple blocks within each stage, a set of multi-scale features {F1, F2, ..., F} is output. N The StarNet adopts a block structure with adjustable width, residual branches, and depthwise separable convolutions. Each block achieves feature transformation through depthwise separable convolutions and two 1×1 convolution branches, and combines DropPath regularization and residual connections to output updated features.

[0017] The multi-scale features are first processed by global average pooling, max pooling, and a shared fully connected layer to calculate the channel attention weights M. c The features are channel-weighted to obtain X1. Then, based on the color prior, the color channel features corresponding to the target are extracted, and combined with the global average pooling feature M of X1. avg With max pooling feature M max Generate spatial attention weights M s Spatial weighting of X1 yields enhanced feature X. out ;

[0018] A multi-level attention fusion strategy is adopted to integrate the shallow features F of the StarNet backbone network. low With deep features F deep The CPSCA module, which embeds shared parameters, performs background noise suppression and local texture enhancement on shallow features, and semantic information fusion on deep features. The enhanced shallow attention features F are then combined. att low With deep attention features F att deep Multi-scale fusion features are obtained by merging them through subsequent fusion modules;

[0019] First, the feature spatial location (x, y) in the multi-scale fused features is converted into polar coordinate form (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe ;

[0020] The rotation sensing enhancement feature F rpe Input the detection head to generate a rotational bounding box prediction result, achieving high-precision detection of targets in any pose.

[0021] In the above technical solution, preferably, the StarNet backbone network includes N=4 feature extraction stages, and the number of blocks in each stage is set as follows:

[0022] Stage 1: Contains 2 blocks, used to extract shallow edge and color texture features;

[0023] Stage 2: Contains 3 blocks to enhance the expression of mid-to-low-level structural features;

[0024] Stage 3: Contains 5 blocks for extracting high-level semantic features;

[0025] Stage 4: Contains 2 blocks, used to form the final deep semantic representation;

[0026] In each block, the convolution kernels W1 and W2 of the two 1×1 convolution branches are both 1×1 in size. The depthwise separable convolution DSConv uses a 7×7 convolution kernel for its depthwise convolution part and a 1×1 convolution kernel for its pointwise convolution part. The convolution parameters of W1, W2 and DSConv are all initialized using the He initialization method. The probability p of the DropPath increases linearly with the network depth. The p of shallow blocks is 0 or close to 0, and the p of deep blocks does not exceed 0.1-0.2.

[0027] In any of the above technical solutions, preferably, in the CPSCA module, the shared fully connected layer for channel attention adopts a two-layer structure. The number of neurons in the first fully connected layer is C / r, where C is the number of input feature channels and r is the channel compression ratio. The number of neurons in the second fully connected layer is C, and the channel compression ratio r is 8 or 16. The ReLU activation function is used after the first fully connected layer, and the Sigmoid activation function is used after the second fully connected layer. When calculating the color prior weight Mp, a 3×3 convolution kernel is used for the extracted target color channel features, with a stride of 1 and a padding method of "same".

[0028] In any of the above technical solutions, preferably, the subsequent fusion module of the multi-level attention fusion strategy has the following specific structure: first, the shallow attention features F are processed... att low Upsampling is performed to make its spatial size consistent with the deep attention feature F. att deep Consistent; then align the F att low With F att deep By splicing along the channel dimension, a fusion feature F is formed. cat Finally, regarding F catApplying a 1×1 convolution to perform channel compression and information recombination yields the final fused feature F. fusion The weight parameters of the 1×1 convolution are adaptively learned through end-to-end training.

[0029] In any of the above technical solutions, preferably, the specific expression of the sine and cosine coding functions is as follows:

[0030] The position encoding for radial distance r is:

[0031] , ;

[0032] The position code for angle θ is:

[0033] , ;

[0034] Where k is the encoding dimension index, and d is the dimension of the position encoding vector; the PE r With PE θ The fusion method is channel-dimensional concatenation, and the fusion method between the concatenated position-encoded features (PE) and the input features is additive fusion, i.e., F. rpe =F in +PE, where PE represents the positional coding feature generated by radial distance and angle coding.

[0035] In any of the above technical solutions, preferably, it further includes optimizing the network using a multi-task joint loss function, the overall loss function being expressed as:

[0036] L=λ cls L cls +λ reg L reg +λ ang L ang ;

[0037] Where: L cls For the classification loss function, cross-entropy loss or FocalLoss is used; L reg For the rotation bounding box position regression loss, an IoU-based loss function is used, including but not limited to RotatedIoULoss or GIoULoss; L ang For angle regression loss, SmoothL1Loss or periodically perceived angle loss function is used to reduce angle prediction error; weighting coefficient λ cls , λ reg , λ ang Determined through experimental experience settings or automatic adjustment methods.

[0038] In any of the above technical solutions, preferably, the data augmentation methods used during the training process include: geometric augmentation: random rotation ±45°, random scaling ratio 0.8-1.2, random translation; color augmentation: random adjustment of brightness, contrast and saturation, with an adjustment range of ±20% of the original value, random horizontal flip, and Mosaic or MixUp blending augmentation. All augmentation operations are performed simultaneously by adjusting the rotating bounding box annotations to maintain consistency.

[0039] In any of the above technical solutions, preferably, each Block pairs features The feature transformation y is generated by performing depthwise separable convolution and two 1×1 convolutional branches. The specific calculation formula is as follows:

[0040] Where * denotes a convolution operation, DSConv indicates depthwise separable convolution, and W1 and W2 are the convolution kernel parameters. This represents the input features of the (j-1)th block within the i-th feature extraction stage;

[0041] The feature transformation result y is obtained by applying or skipping DropPath regularization during the training phase. : ;

[0042] Finally, the residual connection output is used to update the features: .

[0043] The second aspect of the present invention provides a target detection system based on color prior attention and rotation-aware position encoding, comprising:

[0044] The input initialization module is set to receive RGB images IRGB∈RH×W×3 as the basic data for feature extraction in target fruit detection.

[0045] The SNet'-YOLO backbone network module is configured to input the RGB image into the optimized StarNet' backbone network. Through Stem layers, N feature extraction stages, and cyclic processing of multiple blocks within each stage, it outputs a set of multi-scale features {F1, F2, ..., FN}. The StarNet' adopts a block structure with adjustable width, residual branches, and depthwise separable convolutions. Each block achieves feature transformation through depthwise separable convolutions and two 1×1 convolution branches, and combines DropPath regularization and residual connections to output updated features.

[0046] The color prior space-channel attention module is configured to first calculate the channel attention weights M of the multi-scale features through global average pooling, max pooling, and a shared fully connected layer. cThe features are channel-weighted to obtain X1. Then, based on the color prior, the color channel features corresponding to the target are extracted, and combined with the global average pooling feature M of X1. avg With max pooling feature M max Generate spatial attention weights M s Spatial weighting of X1 yields enhanced feature X. out ;

[0047] The multi-level attention fusion strategy module is configured to employ a multi-level attention fusion strategy in the shallow features F of the StarNet backbone network. low With deep features F deep The CPSCA module, which embeds shared parameters, performs background noise suppression and local texture enhancement on shallow features, and semantic information fusion on deep features. The enhanced shallow attention features F are then combined. att low With deep attention features F att deep Multi-scale fusion features are obtained by merging them through subsequent fusion modules;

[0048] The rotation-aware position encoding module is configured to first convert the feature spatial position (x, y) in the multi-scale fused features into polar coordinates (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe The rotation sensing enhancement feature F rpe Input the detection head to generate a rotational bounding box prediction result, achieving high-precision detection of targets in any pose.

[0049] The third aspect of the present invention provides a computer device, including a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the steps of the target detection method based on color prior attention and rotation perception position encoding provided in the first aspect of the present invention.

[0050] Compared with existing technologies, the target detection method and device based on color prior attention and rotation-aware position encoding provided by this invention have the following advantages:

[0051] 1. Improve the accuracy of rotating target localization: This invention introduces a rotation-aware position coding module (RPE), which enhances the model's ability to model direction and angle information through polar coordinate coding, effectively improving the fitting accuracy of the rotating bounding box, and is suitable for fruit detection tasks with varied postures in natural conditions.

[0052] 2. Enhanced color discrimination ability and robustness: The proposed Color Prior Space-Channel Attention Module (CPSCA) explicitly introduces the red channel prior, guiding the model to focus on the salient area of ​​the fruit, effectively suppressing background interference, and improving the detection accuracy under complex lighting and background conditions.

[0053] 3. Enhance multi-scale fusion and occlusion detection capabilities: By introducing the CPSCA module into shallow and deep features and adopting a parameter sharing approach, the model effectively integrates local details with global semantics, significantly improving the detection performance of the model in small target and occlusion scenarios.

[0054] 4. Lightweight and efficient network structure: The optimized StarNet backbone network adopts a block structure and depthwise separable convolution, which significantly reduces computational overhead while ensuring feature representation capabilities, making it suitable for resource-constrained agricultural application environments.

[0055] 5. Strong module versatility, easy integration and expansion: CPSCA and RPE modules have good compatibility and can be flexibly embedded into various rotating target detection frameworks. They are suitable for other detection tasks with significant colors or varied postures and have high application value. Attached Figure Description

[0056] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the description of the embodiments taken in conjunction with the following drawings, in which:

[0057] Figure 1 A flowchart of the target detection method according to an embodiment of the present invention is shown;

[0058] Figure 2 A flowchart illustrating the feature acquisition process of StarNet according to an embodiment of the present invention is shown;

[0059] Figure 3 A flowchart illustrating the feature acquisition process of the CPSCA module according to an embodiment of the present invention is shown. Detailed Implementation

[0060] To better understand the above-mentioned objectives, features, and advantages of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.

[0061] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and therefore the scope of protection of the invention is not limited to the specific embodiments disclosed below.

[0062] The purpose of this invention is to provide an improved target detection method based on color prior attention and rotation-aware position encoding, thereby enhancing the detection capability of fruit targets and promoting the application of computer vision technology in agricultural production.

[0063] Flowchart as follows Figure 1 As shown: 1. Input RGB Image: The system first receives the original three-channel color image as the basis for subsequent feature extraction. 2. Backbone Network: The input image is processed by an optimized StarNet backbone network, which uses depthwise separable convolutions and multi-stage modules to extract rich visual features, effectively enhancing the representation ability of targets in complex backgrounds. 3. Color Prior Space-Channel Attention Module (CPSCA): Combining color prior knowledge, especially the spatial and channel attention mechanism for the red channel of tomato fruit, enhances the model's response to the target region, while suppressing background noise and improving the discriminative power of features. 4. Multi-level Attention Fusion Strategy: Attention mechanisms are embedded in shallow and deep features respectively to achieve effective fusion of local details and global semantic information, enhance the expressive power of high-level features, and promote the refinement of target detection. 5. Rotation-Aware Position Encoding Module: Rotation-aware position information encoding is introduced, enabling the model to more accurately perceive and represent the spatial position of rotating targets, improving the localization accuracy of rotation bounding boxes. 6. Output Rotation Bounding Box Prediction Results: The model generates accurate rotation bounding boxes, achieving high-precision detection of tomato fruits in any orientation.

[0064] To achieve the above objectives, this invention proposes a target detection method based on color prior attention and rotation-aware position encoding, the details of which are as follows:

[0065] like Figures 1 to 3 As shown, a target detection method based on color prior attention and rotation-aware position encoding according to an embodiment of the present invention includes:

[0066] Receive RGB image I RGB ∈R H×W×3 , serving as the basic data for feature extraction in target fruit detection;

[0067] The RGB image is input into the optimized StarNet backbone network. Through Stem layers, N feature extraction stages, and iterative processing of multiple blocks within each stage, a set of multi-scale features {F1, F2, ..., F} is output. NThe StarNet adopts a block structure with adjustable width, residual branches, and depthwise separable convolutions. Each block achieves feature transformation through depthwise separable convolutions and two 1×1 convolution branches, and combines DropPath regularization and residual connections to output updated features.

[0068] In this step, to improve the performance of object detection networks in complex backgrounds and non-rigid object detection, this paper proposes the SNet'-YOLO framework. This structure is built on the YOLOv8-OBB architecture and introduces an optimized lightweight backbone network, StarNet'. StarNet' employs a block structure with adjustable width, residual branches, and depthwise separable convolutions, effectively controlling the number of parameters and computational complexity while improving the model's representational capabilities. Its structure is particularly suitable for object fruit detection in complex scenes. Specifically, StarNet's feature acquisition is as follows... Figure 2 As shown:

[0069] Step 1-1: Input Initialization

[0070] The input image is represented as I RGB ∈R H×W×3 The initial feature F0 is obtained after passing through the Stem layer. The Stem layer consists of convolution and ReLU6 activation function to ensure basic feature representation and high computational efficiency.

[0071] Step 1-2: Stage Iteration

[0072] The network consists of N stages, each stage taking input features F. (i-1) The features are transformed into higher-level features F through downsampling and convolution operations. i This enables multi-scale feature extraction.

[0073] Steps 1-3: Intra-phase block loop

[0074] Each stage contains multiple blocks, and each block corresponds to a feature. The feature transformation y is generated by performing depthwise separable convolution and two 1×1 convolutional branches. The specific calculation formula is as follows:

[0075] Where * denotes a convolution operation, DSConv indicates depthwise separable convolution, and W1 and W2 are the convolution kernel parameters. This represents the input features of the (j-1)th block within the i-th feature extraction stage.

[0076] Steps 1-4: DropPath selection and residual connection

[0077] The feature transformation result y is obtained by applying or skipping DropPath regularization during the training phase. : ;

[0078] Finally, the residual connection output is used to update the features: .

[0079] Steps 1-5: Stage Feature Output

[0080] After all blocks have been processed, update the number of input channels and proceed to the next stage. After completing all stage iterations, output the feature set {F1, F2, ..., F} for each stage. N This is for use by subsequent modules.

[0081] The multi-scale features are first processed by global average pooling, max pooling, and a shared fully connected layer to calculate the channel attention weights M. c The features are channel-weighted to obtain X1. Then, based on the color prior, the color channel features corresponding to the target are extracted, and combined with the global average pooling feature M of X1. avg With max pooling feature M max Generate spatial attention weights M s Spatial weighting of X1 yields enhanced feature X. out ;

[0082] In this step, addressing the issue that tomato fruit has distinct color features in images but is affected by lighting and background interference, a color-guided spatial-channel joint attention module (CPSCA) is proposed. This module explicitly introduces the color distribution of the target color channel as a prior, guiding the network to focus more on the fruit region during channel selection and spatial focusing. Unlike traditional attention modules that passively respond to salient regions, CPSCA actively enhances the weights of color-related regions. To avoid overfitting and decreased generalization ability, the color prior is fused with the original features through learnable guiding weights, realizing a data-driven color-guided perception mechanism. This module is applicable to other target detection tasks with high color discrimination and has good scalability. The feature acquisition of the CPSCA module is as follows: Figure 3 As shown:

[0083] Step 2-1: Input feature map X

[0084] The input feature map, denoted as X(B,C,H,W), with batch size B, number of channels C, height H, and width W, serves as the input to this module.

[0085] Step 2-2: Channel Attention Calculation

[0086] Determine whether channel attention is enabled. If enabled, first perform global average pooling and max pooling on X to obtain two channel description vectors. Then feed these vectors into a shared fully connected layer (fc) to calculate the channel attention weights M. c(Dimensions are 1×C×1×1). M is multiplied via broadcast multiplication. c We weight the features onto X to obtain the weighted feature map X1 = X × M. C If not enabled, simply set X1=X.

[0087] Steps 2-3: Prior attention judgment of space and color

[0088] Determine whether the fusion mechanism of spatial attention and color prior is enabled. If not enabled, skip this step and output X. out =X1.

[0089] Steps 2-4: Extraction of color prior weights

[0090] If spatial + color prior attention is enabled, further determine whether to use color prior weights. If used, extract the 0th channel (red channel) R=X1[:,0,:,:] from X1, and convolve it to calculate the color prior weights M. p If color prior is not used, then let M be... p It is a zero tensor.

[0091] Steps 2-5: Pooling Feature Calculation and Fusion

[0092] Calculate the global average pooling feature M of X1 avg and max pooling feature M max (Dimensions same as M) C M avg M max and color prior weight M p (If applicable) Perform channel-dimensional concatenation to obtain the fused feature M. cat .

[0093] Steps 2-6: Spatial attention weight calculation and weighting

[0094] By convolution, the fused feature M is obtained. cat Calculate the spatial attention weights M S The dimension matches the spatial size of the input features. Finally, the weighted sum is used to obtain the output feature map X. out =X1×M S The output size is the same as the input X.

[0095] A multi-level attention fusion strategy is adopted to integrate the shallow features F of the StarNet backbone network. low With deep features F deep The CPSCA module, which embeds shared parameters, performs background noise suppression and local texture enhancement on shallow features, and semantic information fusion on deep features. The enhanced shallow attention features F are then combined. att low With deep attention features Fatt deep Multi-scale fusion features are obtained by merging them through subsequent fusion modules;

[0096] In this step, to fully leverage the role of the attention mechanism in features at different levels, this paper designs a multi-level attention fusion strategy. This strategy embeds CPSCA modules into the shallow and deep feature maps of the backbone, respectively. Shallow attention is mainly used to suppress background interference and enhance the recognition of local edges and color textures; while deep attention emphasizes semantic information fusion, improving the robustness of multi-scale object detection. Unlike the full-layer insertion approach, this strategy only enables the attention module at key levels and adopts a shared parameter strategy to ensure controllable computational overhead. Experiments show that this strategy significantly improves detection accuracy while maintaining a lightweight model, especially performing better in occlusion and distant fruit detection tasks. The steps of the multi-level attention fusion strategy are as follows:

[0097] Step 3-1: Shallow Feature Selection

[0098] Select shallow feature F of the backbone low As input, the CPSCA module is introduced to suppress background noise and enhance local texture and edge features.

[0099] Step 3-2: Deep Feature Selection

[0100] Select deep semantic features F deep Furthermore, the CPSCA module with shared weights is introduced again to promote cross-scale semantic fusion and enhance the ability to discriminate complex targets at a high level.

[0101] Step 3-3: Sharing Attention Module Parameters

[0102] The shallow and deep CPSCA modules adopt a shared parameter strategy to reduce parameter redundancy and improve the model's generalization ability.

[0103] Steps 3-4: Multi-layer fusion output

[0104] The CPSCA-enhanced features of the shallow and deep layers are output as F, respectively. att low and F att deep And through subsequent fusion modules, a more robust multi-scale representation is obtained.

[0105] Steps 3-5: Calculate overhead control

[0106] By selectively embedding attention modules and sharing parameters, the computational load remains controllable, balancing performance and efficiency.

[0107] First, the feature spatial location (x, y) in the multi-scale fused features is converted into polar coordinate form (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe ;

[0108] The rotation sensing enhancement feature F rpe Input the detection head to generate a rotated bounding box prediction result, achieving high-precision detection of targets in any pose;

[0109] In this step, the position encoding method significantly impacts the model's orientation sensitivity in the rotating target detection task. This paper proposes a rotation-aware position encoding module (RPE) that encodes angle and radius information in polar coordinates into the spatial feature representation, enhancing the model's ability to perceive target orientation and arrangement. This module introduces angle embedding vectors and relative orientation position information, fusing them with the feature map to generate spatially enhanced features with rotational equivariance, thereby achieving higher localization accuracy in angle prediction. The RPE module features a pluggable design, making it suitable for any detection framework containing angle regression, such as YOLOv8-OBB, exhibiting good versatility and model compatibility. The steps of the rotation-aware position encoding module are as follows:

[0110] Step 4-1: Polar coordinate transformation

[0111] The input feature space location (x, y) is converted to polar coordinate form (r, θ), where , It captures spatial information related to rotation.

[0112] Step 4-2: Location Encoding Generation

[0113] Applying sine and cosine coding functions to the radial distance r and angle θ in polar coordinates respectively generates the position coding vector PE. r and PE θ .

[0114] Step 4-3: Encoding Fusion

[0115] PE r and PE θ With input features F in Fusion to form rotational perception enhancement feature F rpe The specific integration method is addition or splicing.

[0116] Step 4-4: Angle Prediction Assistance

[0117] Rotation sensing feature F rpe Input a rotating bounding box prediction head to improve the accuracy of angle regression.

[0118] In the above embodiments, preferably, the StarNet backbone network includes N=4 feature extraction stages, and the number of blocks in each stage is set as follows:

[0119] Stage 1: Contains 2 blocks, used to extract shallow edge and color texture features;

[0120] Stage 2: Contains 3 blocks to enhance the expression of mid-to-low-level structural features;

[0121] Stage 3: Contains 5 blocks for extracting high-level semantic features;

[0122] Stage 4: Contains 2 blocks, used to form the final deep semantic representation;

[0123] In each block, the convolution kernels W1 and W2 of the two 1×1 convolution branches are both 1×1 in size. The depthwise separable convolution DSConv uses a 7×7 convolution kernel for its depthwise convolution part and a 1×1 convolution kernel for its pointwise convolution part. The convolution parameters of W1, W2 and DSConv are all initialized using the He initialization method. The probability p of the DropPath increases linearly with the network depth. The p of shallow blocks is 0 or close to 0, and the p of deep blocks does not exceed 0.1-0.2.

[0124] In this embodiment, a step-by-step feature extraction from shallow details to deep semantics is achieved through the combination of four feature extraction stages and a varying number of differential blocks. This ensures the complete capture of basic features such as edges and textures while enhancing the expressive power of high-level semantic features, thus adapting to the detection needs of dense targets and varying scales in agricultural scenarios. The combination of 1×1 convolutional kernels and 7×7 depth convolutions expands the receptive field and improves feature modeling capabilities while effectively reducing computational overhead. The He initialization method ensures gradient stability in the early stages of deep network training, accelerating model convergence. The DropPath probability design, which increases with network depth, protects the stable learning of shallow basic features while avoiding model overfitting through appropriate deep regularization, thus balancing the completeness of feature extraction with the model's generalization ability.

[0125] In any of the above embodiments, preferably, in the CPSCA module, the shared fully connected layer for channel attention adopts a two-layer structure. The number of neurons in the first fully connected layer is C / r, where C is the number of input feature channels and r is the channel compression ratio. The number of neurons in the second fully connected layer is C, and the channel compression ratio r is 8 or 16. The ReLU activation function is used after the first fully connected layer, and the Sigmoid activation function is used after the second fully connected layer. When calculating the color prior weight Mp, a 3×3 convolution kernel is used for the extracted target color channel features, with a stride of 1 and a padding method of "same".

[0126] In this embodiment, the two fully connected layers and the reasonable channel compression ratio design effectively control the parameter scale and avoid computational redundancy while realizing nonlinear modeling of channel features. The combination of ReLU and Sigmoid activation functions not only ensures the nonlinear expressive power of channel features, but also generates normalized attention weights in the 0-1 interval, accurately strengthening effective channels and suppressing ineffective channels. The combination of 3×3 convolutional kernels and same padding fully captures local color distribution features without changing the feature map spatial resolution. Combined with the explicit introduction of color prior channels, the model responds more accurately to fruit color features, significantly improving the target discrimination ability under complex lighting and background interference, and enhancing detection robustness.

[0127] In any of the above embodiments, preferably, the subsequent fusion module of the multi-level attention fusion strategy has the following specific structure: first, the shallow attention features F are processed... att low Upsampling is performed to make its spatial size consistent with the deep attention feature F. att deep Consistent; then align the F att low With F att deep By splicing along the channel dimension, a fusion feature F is formed. cat Finally, regarding F cat Applying a 1×1 convolution to perform channel compression and information recombination yields the final fused feature F. fusion The weight parameters of the 1×1 convolution are adaptively learned through end-to-end training.

[0128] In this embodiment, the upsampling operation achieves spatial size alignment of shallow and deep features, laying the foundation for cross-level information fusion; channel dimension concatenation directly integrates shallow local details (if real edges, textures) with deep global semantics (if real overall contours, category features), avoiding the information limitations of single-level features; 1×1 convolution channel compression and adaptive weight learning not only eliminate the dimensional redundancy brought about by feature concatenation, but also automatically adjust the contribution weights of shallow and deep features through training, making the fused features more in line with the needs of the detection task, significantly improving the model's detection accuracy for small targets and occluded areas, and strengthening the adaptability to multi-scale targets.

[0129] In any of the above embodiments, preferably, the specific expression of the sine and cosine coding functions is:

[0130] The position encoding for radial distance r is:

[0131] , ;

[0132] The position code for angle θ is:

[0133] , ;

[0134] Where k is the encoding dimension index, and d is the dimension of the position encoding vector; the PE r With PE θ The fusion method is channel-dimensional concatenation, and the fusion method between the concatenated position-encoded features (PE) and the input features is additive fusion, i.e., F. rpe =F in +PE, where PE represents the positional coding feature generated by radial distance and angle coding.

[0135] In this embodiment, the sine and cosine coding functions model radial distance and angle information using trigonometric functions of different frequencies, accurately capturing the spatial position and rotational attitude features of the target. Furthermore, the coding results exhibit periodicity and translation invariance, adapting to scenarios with arbitrary fruit posture distributions; channel splicing and fusion PE r With PE θ It fully preserves the spatial information of radial distance (distance between the target and the center) and angle (direction of target rotation); the additive fusion method efficiently injects rotation-aware position encoding without destroying the original semantic information of the input features, which significantly improves the model's sensitivity to changes in target rotation, greatly reduces the localization error of the rotation bounding box, and improves the detection accuracy of fruits in any pose.

[0136] In any of the above embodiments, preferably, the network is further optimized using a multi-task joint loss function, the overall loss function being expressed as:

[0137] L=λ cls L cls +λ reg L reg +λ ang L ang ;

[0138] Where: L cls For the classification loss function, cross-entropy loss or FocalLoss is used; L reg For the rotation bounding box position regression loss, an IoU-based loss function is used, including but not limited to RotatedIoULoss or GIoULoss; L ang For angle regression loss, SmoothL1Loss or periodically perceived angle loss function is used to reduce angle prediction error; weighting coefficient λ cls , λ reg , λ ang Determined through experimental experience settings or automatic adjustment methods.

[0139] In this embodiment, the joint optimization of classification loss, rotation bounding box regression loss, and angle regression loss achieves synergistic improvement in target category judgment, location localization, and rotation angle prediction, avoiding performance imbalance caused by single-task optimization. FocalLoss effectively solves the class imbalance problem, IoU-based regression loss is more sensitive to bounding box position errors, and SmoothL1Loss or periodic-aware angle loss can reduce the impact of outliers in angle prediction. The targeted design of different loss functions improves the optimization effect of each task. Adjustable weight coefficients allow the model to flexibly adapt to the needs of actual scenarios (such as focusing on classification accuracy or localization accuracy), further improving the overall detection performance of the model in complex agricultural scenarios.

[0140] In any of the above embodiments, preferably, the data augmentation methods used during the training process include: geometric augmentation: random rotation ±45°, random scaling ratio 0.8-1.2, random translation; color augmentation: random adjustment of brightness, contrast and saturation by ±20% of the original value, random horizontal flipping, and Mosaic or MixUp blending augmentation. All augmentation operations are performed simultaneously by adjusting the rotating bounding box labels to maintain consistency.

[0141] In this embodiment, geometric augmentation effectively expands the diversity of target pose, scale, and position in the training data, allowing the model to adapt to arbitrary arrangements and pose changes of fruits in natural scenes; color augmentation improves the model's adaptability to complex lighting conditions and reduces the interference of environmental factors such as strong light and shadow on color feature discrimination; Mosaic or MixUp augmentation further enriches the data distribution and strengthens the model's ability to detect small and dense targets; synchronous adjustment of the rotating bounding box annotations ensures the accuracy of the labels after data augmentation, ensuring the effectiveness of model training, and ultimately significantly improves the model's generalization ability and robustness, enabling it to stably perform detection performance in real agricultural scenarios.

[0142] Another embodiment of the target detection system based on color prior attention and rotation-aware position encoding according to the present invention includes:

[0143] The input initialization module is set to receive RGB images IRGB∈RH×W×3 as the basic data for feature extraction in target fruit detection.

[0144] The SNet'-YOLO backbone network module is configured to input the RGB image into the optimized StarNet' backbone network. Through Stem layers, N feature extraction stages, and cyclic processing of multiple blocks within each stage, it outputs a set of multi-scale features {F1, F2, ..., FN}. The StarNet' adopts a block structure with adjustable width, residual branches, and depthwise separable convolutions. Each block achieves feature transformation through depthwise separable convolutions and two 1×1 convolution branches, and combines DropPath regularization and residual connections to output updated features.

[0145] In this step, RPE is integrated with the YOLOv8-OBB framework in a modular manner. The RPE module is inserted before the angle regression branch in the detection head to enhance the ability of features to express rotational information.

[0146] In the YOLOv8-OBB framework, feature maps are output from the backbone network and feature fusion network, and then fed into the detection head for classification, position regression, and angle prediction. This invention introduces an RPE module at the input feature Fin of the angle prediction branch to perform rotation-aware enhancement on this feature, obtaining the enhanced feature Frpe, which is then input into the angle regression layer. This integration method does not require changing the structure of the original classification branch and center position regression branch; it only locally enhances the angle prediction path, ensuring the compatibility and stability of the overall model structure.

[0147] The color prior space-channel attention module is configured to first calculate the channel attention weights M of the multi-scale features through global average pooling, max pooling, and a shared fully connected layer.c The features are channel-weighted to obtain X1. Then, based on the color prior, the color channel features corresponding to the target are extracted, and combined with the global average pooling feature M of X1. avg With max pooling feature M max Generate spatial attention weights M s Spatial weighting of X1 yields enhanced feature X. out ;

[0148] In this step, the main adjustments made when adapting to different target categories include:

[0149] Color channel selection: For red apples or ripe peppers, the red channel in the RGB color space is preferred; for green apples or unripe fruits, the green channel can be used; under complex lighting conditions, the image can also be converted to the HSV or Lab color space, and the H channel or a* channel can be selected as the color prior input.

[0150] Color prior convolution parameter adjustment: Based on the difference between the target scale and color distribution, the color prior convolution kernel size is adjusted within the range of 1×1 to 5×5.

[0151] With the above adjustments, effective detection of prominent targets of different colors can be achieved without changing the overall structure of the CPSCA module.

[0152] The multi-level attention fusion strategy module is configured to employ a multi-level attention fusion strategy in the shallow features F of the StarNet backbone network. low With deep features F deep The CPSCA module, which embeds shared parameters, performs background noise suppression and local texture enhancement on shallow features, and semantic information fusion on deep features. The enhanced shallow attention features F are then combined. att low With deep attention features F att deep Multi-scale fusion features are obtained by merging them through subsequent fusion modules;

[0153] The rotation-aware position encoding module is configured to first convert the feature spatial position (x, y) in the multi-scale fused features into polar coordinates (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe The rotation sensing enhancement feature F rpeInput the detection head to generate a rotational bounding box prediction result, achieving high-precision detection of targets in any pose.

[0154] Performance data of existing technologies in specific indicators such as detection accuracy, positioning error, and computational load.

[0155] The performance of existing mainstream rotating target detection networks on agricultural fruit datasets is typically as follows: detection accuracy (mAP@0.5) is approximately 82%–86%; the average localization error of the rotating bounding box angle is approximately 6°–12°, and the error increases further under conditions of drastic changes in target pose or partial occlusion; the number of model parameters is typically between 3 and 4M, and the inference computation is approximately 14–20 GFLOPs, which limits their deployment on embedded or edge devices.

[0156] After preprocessing and filtering, a total of 1508 image samples were obtained. To facilitate model training and evaluation, the dataset was divided into training set, validation set and test set in a ratio of 7:2:1.

[0157] This dataset, collected from a research greenhouse, contains cherry tomato fruits of varying shapes and ripeness. The detection task focuses on identifying tomatoes at different ripeness levels and under different environmental conditions, without considering specific genetic or varietal differences. The collected images cover a variety of challenging environmental conditions, such as light intensity, time of day, background clutter, and ripeness, providing diverse scenarios to evaluate the robustness of the model.

[0158] Image annotations were completed using the roLabelImg tool, with annotations presented as oriented bounding boxes. Each annotation file only supports the YOLO model, is saved in txt format, and includes class labels and the coordinates of the four bounding box corners.

[0159] In the process of designing the technical solution of this invention, other potential technical paths were analyzed and compared, including different types of attention mechanisms and position encoding methods.

[0160] Regarding attention mechanisms, although self-attention and non-local attention can model long-range dependencies, their computational complexity increases with the square of the feature size, making them unsuitable for real-time detection tasks in high-resolution agricultural scenarios. Furthermore, methods that only employ channel attention or spatial attention cannot simultaneously balance color saliency and spatial focusing capabilities.

[0161] In contrast, the CPSCA module proposed in this invention, by introducing color prior information, explicitly enhances the discrimination ability of fruit regions while maintaining low computational complexity, making it more suitable for agricultural target detection tasks with obvious color features.

[0162] Regarding position encoding methods, traditional absolute position encoding based on Cartesian coordinates is difficult to effectively model the rotation characteristics of a target, while implicit modeling methods based on angle regression are unstable when the attitude changes significantly. This invention adopts a rotation-aware position encoding method based on polar coordinates, explicitly introducing angle and radial information into the feature representation. This improves the orientation awareness capability while avoiding the computational burden caused by complex group equivariant networks.

[0163] Therefore, considering factors such as overall detection accuracy, computational efficiency, and engineering feasibility, the technical approach adopted in this invention is more practical and has comprehensive advantages.

[0164] The specific implementation parameters for tomato detection include hyperparameters such as input image size, learning rate, and batch size. The input image size is 1280×720, and the hyperparameters are shown in the table below:

[0165] parameter Configuration parameters Number of training rounds 300 Image size 640×640 batch 64 Optimizer AdamW Learning rate 0.002 momentum 0.9 Weight decay 0.0005 Number of preheating cycles 3.0 Preheating momentum 0.8

[0166] Based on the above, Figure 1 and Figure 2 Accordingly, this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the target detection method based on color prior attention and rotation-aware position encoding of any of the above embodiments.

[0167] Based on this understanding, the technical solution of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as CD-ROM, USB flash drive, mobile hard drive, etc.) and includes several instructions to cause a computer device (such as personal computer, server, or network device, etc.) to execute the methods of various implementation scenarios of this application.

[0168] Based on the above, Figure 1 and Figure 2 The method shown, and Figure 3 To achieve the above objectives, the virtual device embodiment shown in this application also provides a computer device, including a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the steps of the target detection method based on color prior attention and rotation perception position encoding of any of the above embodiments.

[0169] Optionally, the computer device may also include a user interface, a network interface, a camera, radio frequency (RF) circuitry, sensors, audio circuitry, a Wi-Fi module, etc. The user interface may include a display screen, input units such as a keyboard, etc., and optional user interfaces may also include USB ports, card reader ports, etc. The network interface may optionally include standard wired interfaces, wireless interfaces (such as Bluetooth interfaces, Wi-Fi interfaces), etc.

[0170] Those skilled in the art will understand that the computer device structure provided in this embodiment does not constitute a limitation on the computer device, and may include more or fewer components, or combine certain components, or have different component arrangements.

[0171] The storage medium may also include an operating system and a network communication module. The operating system is a program that manages and stores the hardware and software resources of a computer device, supporting the operation of information processing programs and other software and / or programs. The network communication module is used to enable communication between the various components within the storage medium, as well as communication with other hardware and software within the physical device.

[0172] In this invention, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance; the term "multiple" refers to two or more unless otherwise explicitly defined. The terms "install," "connect," "link," and "fix" should be interpreted broadly. For example, "connect" can be a fixed connection, a detachable connection, or an integral connection; "link" can be a direct connection or an indirect connection through an intermediate medium. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0173] In the description of this invention, it should be understood that the terms "upper," "lower," "left," "right," "front," "rear," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are only for the convenience of describing this invention and simplifying the description, and do not indicate or imply that the device or unit referred to must have a specific orientation or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on this invention.

[0174] In the description of this specification, the terms "one embodiment," "some embodiments," "specific embodiment," etc., refer to a specific feature, structure, material, or characteristic described in connection with that embodiment or example, which is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0175] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A target detection method based on color prior attention and rotation-aware position encoding, characterized in that, include: Receiving an RGB image I RGB ∈R H×W×3 , as a feature extraction basis data for target fruit detection; The RGB image is input into an optimized Stem network of StarNet', and a set of multi-scale features {F1, F2, …, F N} is output through a Stem layer, N feature extraction stages, and a loop processing of multiple Blocks in each stage. The Stem network of StarNet' adopts a block structure with adjustable width, residual branches, and depthwise separable convolution. Each Block realizes feature transformation through depthwise separable convolution and two 1×1 convolution branches, and outputs updated features in combination with DropPath regularization and residual connection. The multi-scale features are first calculated by global average pooling, maximum pooling and shared fully connected layer to obtain channel attention weight M c The features are channel weighted to obtain X1, and the color channel features corresponding to the target are extracted according to the color prior, and the global average pooling features M of X1 are combined avg The maximum pooling features M max The spatial attention weight M is generated s The enhanced features X are obtained by spatial weighting X1 out ; Adopting a multi-level attention fusion strategy, the shallow features F low and the deep features F deep respectively embed the shared parameter CPSCA module, suppress the background noise and enhance the local texture for the shallow features, and fuse the semantic information for the deep features, and then the enhanced shallow attention features F att low and the deep attention features F att deep are merged through a subsequent fusion module to obtain multi-scale fusion features; First, the feature spatial location (x, y) in the multi-scale fused features is converted into polar coordinate form (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe ; The rotation-sensing enhancement feature F rpe Input the detection head to generate a rotational bounding box prediction result, achieving high-precision detection of targets in any pose.

2. The target detection method according to claim 1, characterized in that, The StarNet backbone network comprises N=4 feature extraction stages, with the number of blocks in each stage set as follows: Stage 1: Contains 2 blocks, used to extract shallow edge and color texture features; Stage 2: Contains 3 blocks to enhance the expression of mid-to-low-level structural features; Stage 3: Contains 5 blocks for extracting high-level semantic features; Stage 4: Contains 2 blocks, used to form the final deep semantic representation; In each block, the convolution kernels W1 and W2 of the two 1×1 convolution branches are both 1×1 in size. The depthwise separable convolution DSConv uses a 7×7 convolution kernel for its depthwise convolution part and a 1×1 convolution kernel for its pointwise convolution part. The convolution parameters of W1, W2 and DSConv are all initialized using the He initialization method. The probability p of the DropPath increases linearly with the network depth.

3. The target detection method according to claim 1, characterized in that, In the CPSCA module, the shared fully connected layer for channel attention adopts a two-layer structure. The number of neurons in the first fully connected layer is C / r, where C is the number of input feature channels and r is the channel compression ratio. The number of neurons in the second fully connected layer is C, and the channel compression ratio r is either 8 or 16. The ReLU activation function is used after the first fully connected layer, and the Sigmoid activation function is used after the second fully connected layer. When calculating the color prior weight Mp, a 3×3 convolution kernel is used for the extracted target color channel features, with a stride of 1 and a padding method of "same".

4. The target detection method according to claim 1, characterized in that, The specific structure of the subsequent fusion module of the multi-level attention fusion strategy is as follows: first, the shallow attention features F are processed... att low Upsampling is performed to make its spatial size consistent with the deep attention feature F. att deep Consistent; then align the F att low With F att deep By splicing along the channel dimension, a fusion feature F is formed. cat Finally, regarding F cat Applying a 1×1 convolution to perform channel compression and information recombination yields the final fused feature F. fusion The weight parameters of the 1×1 convolution are adaptively learned through end-to-end training.

5. The target detection method according to claim 1, characterized in that, The specific expression for the sine and cosine coding functions is as follows: The position encoding for radial distance r is: , ; The position code for angle θ is: , ; Where k is the encoding dimension index, and d is the dimension of the position encoding vector; the PE r With PE θ The fusion method is channel-dimensional concatenation, and the fusion method between the concatenated position-encoded features (PE) and the input features is additive fusion, i.e., F. rpe =F in +PE, where PE represents the positional coding feature generated by radial distance and angle coding.

6. The target detection method according to claim 1, characterized in that, This also includes optimizing the network using a multi-task joint loss function, the overall loss function being expressed as: L=λ cls L cls +λ reg L reg +λ ang L ang ; Where: L cls For the classification loss function, cross-entropy loss or FocalLoss is used; L reg For the rotation bounding box position regression loss, an IoU-based loss function is used, including but not limited to RotatedIoULoss or GIoULoss; L ang For angle regression loss, SmoothL1Loss or periodically perceived angle loss function is used to reduce angle prediction error; weighting coefficient λ cls , λ reg , λ ang Determined through experimental experience settings or automatic adjustment methods.

7. The target detection method according to claim 1, characterized in that, The data augmentation techniques employed during training include: geometric augmentation: random rotation ±45°, random scaling ratio 0.8-1.2, and random translation; color augmentation: random adjustment of brightness, contrast, and saturation by ±20% of the original value, random horizontal flipping, and Mosaic or MixUp blending enhancement. All augmentation operations are performed simultaneously with the rotation of the bounding box annotations to maintain consistency.

8. The target detection method according to claim 1, characterized in that, Each Block has features The feature transformation y is generated by performing depthwise separable convolution and two 1×1 convolutional branches. The specific calculation formula is as follows: Where * denotes a convolution operation, DSConv indicates depthwise separable convolution, and W1 and W2 are the convolution kernel parameters. This represents the input features of the (j-1)th block within the i-th feature extraction stage; The feature transformation result y is obtained by applying or skipping DropPath regularization during the training phase. : ; Finally, the residual connection output is used to update the features: .

9. A target detection system based on color prior attention and rotation-aware position encoding, characterized in that, include: The input initialization module is set to receive RGB images IRGB∈RH×W×3 as the basic data for feature extraction in target fruit detection. The SNet'-YOLO backbone network module is configured to input the RGB image into the optimized StarNet' backbone network. Through Stem layers, N feature extraction stages, and cyclic processing of multiple blocks within each stage, it outputs a set of multi-scale features {F1, F2, ..., FN}. The StarNet' adopts a block structure with adjustable width, residual branches, and depthwise separable convolutions. Each block achieves feature transformation through depthwise separable convolutions and two 1×1 convolution branches, and combines DropPath regularization and residual connections to output updated features. The color prior space-channel attention module is configured to first calculate the channel attention weights M of the multi-scale features through global average pooling, max pooling, and a shared fully connected layer. c The features are channel-weighted to obtain X1. Then, based on the color prior, the color channel features corresponding to the target are extracted, and combined with the global average pooling feature M of X1. avg With max pooling feature M max Generate spatial attention weights M s Spatial weighting of X1 yields enhanced feature X. out ; The multi-level attention fusion strategy module is configured to employ a multi-level attention fusion strategy in the shallow features F of the StarNet backbone network. low With deep features F deep The CPSCA module, which embeds shared parameters, performs background noise suppression and local texture enhancement on shallow features, and semantic information fusion on deep features. The enhanced shallow attention features F are then combined. att low With deep attention features F att deep Multi-scale fusion features are obtained by merging them through subsequent fusion modules; The rotation-aware position encoding module is configured to first convert the feature spatial position (x, y) in the multi-scale fused features into polar coordinates (r, θ), where , Then, position encoding vectors PE are generated for r and θ respectively using sine and cosine encoding functions. r With PE θ , PE r With PE θ After fusion, the feature is added to the input feature to obtain the rotation-aware enhancement feature F. rpe The rotation sensing enhancement feature F rpe Input the detection head to generate a rotational bounding box prediction result, achieving high-precision detection of targets in any pose.

10. A computer device, characterized in that, It includes a storage medium and a processor; the storage medium is used to store a computer program; the processor is used to execute the computer program to implement the steps of the target detection method based on color prior attention and rotation-aware position encoding as described in any one of claims 1 to 8.