A rfeF-yolo target detection method
By constructing a special preliminary effective layer and feature inference fusion network based on the RFEF-YOLO target detection method, and combining grayscale preprocessing and sample association loss optimization, the accuracy and speed issues of target detection on embedded platforms are solved, and efficient target recognition is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGSU UNIV OF SCI & TECH
- Filing Date
- 2022-09-06
- Publication Date
- 2026-06-16
AI Technical Summary
On embedded platforms, existing target detection algorithms suffer from low accuracy and high complexity, resulting in low detection speed and inability to meet real-time requirements. Furthermore, lightweight networks are prone to false detections and missed detections.
We adopt an object detection method based on RFEF-YOLO. By constructing a special preliminary effective layer and feature inference fusion network, we combine grayscale preprocessing, backbone feature extraction and feature inference fusion, introduce sample association loss and optimize regression loss to improve detection accuracy and speed.
It improves the accuracy and speed of target detection, with an average detection accuracy increase of 5%, a network model size decrease of 25%, and a detection speed increase of 40%.
Smart Images

Figure CN115588112B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of target detection technology, and more specifically to a target detection method based on RFEF-YOLO. Background Technology
[0002] In recent years, efficient and accurate detection has played a crucial role in industrial production. The purpose of object detection is to classify and locate targets of interest in a given image. However, since autonomous intelligent agent platforms typically lack high computing power, achieving high-precision object detection inevitably faces challenges such as massive data volumes and low detection speeds due to insufficient computing power, thus limiting research in this area. High-precision deep learning object detection algorithms are usually characterized by excessively deep networks, large parameter counts, and higher computational demands, making them difficult to deploy on most embedded platforms with limited computing power and failing to meet real-time requirements. Conversely, overly lightweight networks can lead to false positives and false negatives. Currently, mainstream object detection algorithms typically do not consider any possible relationships between the detected objects, processing each image region separately. This causes the model to rely solely on high-quality convolutional features for target detection, preventing the network from fully utilizing global contextual features. However, in some special scenarios, such as when the detection device is moving at high speed, achieving high-precision recognition becomes virtually impossible. Furthermore, excessively deep networks introduce complexity and a large number of parameters, making them unsuitable for real-time detection on intelligent platforms with limited computing power. Summary of the Invention
[0003] To address the shortcomings of existing technologies, this invention provides an RFEF-YOLO-based target detection method to solve the problems of low accuracy and high complexity in target detection on embedded platforms.
[0004] This invention provides a target detection method based on RFEF-YOLO, the steps of which are as follows:
[0005] Step 1: Acquire an image containing the target to be detected, and perform grayscale preprocessing on the acquired image;
[0006] Step 2: Extract the backbone features from the preprocessed image using a backbone feature extraction network to obtain three preliminary effective feature layers;
[0007] Step 3: Use the three preliminary effective feature layers as input to the feature inference fusion network to extract three effective features;
[0008] Step 4: Use the three valid features as input to the YOLO Head to obtain the object detection results.
[0009] Furthermore, in step 2, the features extracted in the last three stages of the backbone feature extraction network are used as three preliminary effective feature layers.
[0010] Furthermore, in step 3, the feature reasoning fusion network extracts three effective features using the following formula:
[0011]
[0012]
[0013]
[0014]
[0015]
[0016] Among them, B1, B2, and B3 are three preliminary effective feature layers; B 1Upsample B 2Upsample B 3Upsample B is obtained by upsampling the corresponding preliminary effective feature layer; 1Downsample B 2Downsample B 3Downsample The corresponding initial effective feature layer is obtained through downsampling; τ*τConv represents a convolution of size τ*τ; μ1 is B 1Upsample The weights of B1 and B2 are: μ1 = B2; μ2 is the weight of B2; μ3 is the weight of B2. 3Downsample The weights of C; ε1 is the weight of C; ε2 is the weight of B. 2shortcut The weight of B; 2shortcut B2 is obtained through a cross-connection shortcut operation; α1 is the weight of B1; α2 is the weight of D. Downsample The weights; α3 is E 2Downsample The weights of D; Downsample D is obtained through downsampling; E 2Downsample It is obtained through E2 downsampling; β1 is the weight of D; β2 is the weight of E. 3Downsample Weights; E 3Downsample E3 is obtained through downsampling; γ1 is D Upsample The weights of B3; γ2 is the weight of B3; D Upsample D is obtained through upsampling.
[0017] Furthermore, the τ*τConv is 1*1Conv.
[0018] Furthermore, in step 3, the feature reasoning fusion network adds a sample association loss based on the weights in addition to the regression loss. The specific formula for the sample association loss is as follows:
[0019] LossRL (P j P j+1 )=-[(P j -P j-1 )log(P j )+(P j+1 -P j )lo g (P j+1 )]
[0020] Among them, P j-1 P j P j+1 These are the probabilities that the (j-1), j, and j+1th samples belong to the correct category, respectively.
[0021] Furthermore, the loss function of the feature reasoning fusion network is:
[0022] Loss = λ1Loss GIoU +λ2Loss RL +λ3Loss vl
[0023] Where λ1 is the weight coefficient of regression loss; λ2 is the weight coefficient of sample association loss; and λ3 is the weight coefficient of classification loss.
[0024] The beneficial effects of this invention are:
[0025] This invention improves the accuracy and speed of target acquisition by constructing a special initial effective layer to obtain the corresponding effective layer. Simultaneously, this invention introduces a sample association loss (RL) with certain weights on top of the regression loss GIoULoss, allowing the model loss to be correlated before and after, resulting in faster convergence and a smoother loss curve during network training. Compared with existing YOLOv3 training, this invention improves the average detection accuracy by nearly 5%, reduces the network model size by approximately 25%, and improves the detection speed by approximately 40%. Attached Figure Description
[0026] The features and advantages of the invention will be more clearly understood by referring to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way. In the drawings:
[0027] Figure 1 This is an overall block diagram of the method in a specific embodiment of the present invention;
[0028] Figure 2 This is a network structure diagram of a specific embodiment of the method of the present invention;
[0029] Figure 3 This is a structural diagram of the EfficientNet backbone in a specific embodiment of the present invention;
[0030] Figure 4 This is a flowchart of the moving-flipping bottleneck convolution in a specific embodiment of the present invention;
[0031] Figure 5 This is a diagram of the RFN network structure in this invention;
[0032] Figure 6 This is a comparison chart of the experimental results of GIoU loss and the loss model of the present invention under the same test set in a specific embodiment of the present invention;
[0033] Figure 7 This is a comparison chart of the average detection accuracy of the YOLOv3 model and the model of this invention under the same test set in a specific embodiment of the present invention;
[0034] Figure 8 This is a real-time detection effect diagram of a specific embodiment of the present invention. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0036] The present invention will be further illustrated below with reference to specific embodiments. Those skilled in the art should understand that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Modifications to the present invention in various equivalent forms all fall within the scope defined by the appended claims.
[0037] This invention provides a target detection method based on RFEF-YOLO, the steps of which are as follows:
[0038] Step 1: Acquire some image data through the image acquisition device of the intelligent platform, perform grayscale preprocessing on the image data, classify and label some of the image data to obtain the training set, and send the remaining image data as the test set into the RFEF-YOLO network for training.
[0039] Step 2: Use Figure 3 The EfficientNet backbone shown is used as the main feature extraction network to extract three preliminary effective feature layers.
[0040] The EfficientNet backbone network is used for image feature extraction, obtaining three preliminary effective feature layers, denoted as B1, B2, and B3. Taking a 224*224*3 image as an example, the backbone feature extraction is implemented as follows: Stage 1: Batch normalization and Swish activation are performed sequentially on the input 224*224*3 image to obtain the result of the first stage; Stage 2: A shift-flip bottleneck convolution is performed on the 112*112*32 feature map output from the previous stage; Stage 3: Two shift-flip bottleneck convolutions are performed on the 112*112*16 feature map output from the previous stage; Stage 4: Two shift-flip bottleneck convolutions are performed on the 56*56*24 feature map output from the previous stage. Convolution; Stage 5: Perform three shift-flip bottleneck convolutions on the 28*28*40 feature map output from the previous stage; Stage 6: Perform three shift-flip bottleneck convolutions on the 14*14*80 feature map output from the previous stage; Stage 7: Perform four shift-flip bottleneck convolutions on the 14*14*112 feature map output from the previous stage; Stage 8: Perform one shift-flip bottleneck convolution on the 7*7*192 feature map output from the previous stage; The features extracted in stages 6, 7, and 8 are used as the initial effective feature layer, where the shift-flip bottleneck convolution process is as follows: Figure 4 As shown.
[0041] Step 3: Use the three preliminary effective feature layers output from Step 2 as input to the RFN (Reasoning and FusionNet) feature reasoning fusion network to extract the final three effective features E. n n = 1, 2, 3; the specific process is as follows:
[0042] B2, B 1Upsample And B 3Downsample Feature C is obtained by stacking features; where B 1Upsample B1 is obtained through upsampling, B 3Downsample B3 is obtained through downsampling; feature C and B2 are stacked across connections to obtain feature D; feature D is downsampled, E2 is downsampled, and feature B1 is stacked to obtain E1; feature E3 is downsampled and stacked with feature D to obtain feature E2; feature D is upsampled and stacked with feature B3 to obtain E3, as shown. Figure 5 As shown, the RFN network can obtain three effective features using the following formula:
[0043]
[0044]
[0045]
[0046]
[0047]
[0048] Among them, B1, B2, and B3 are three preliminary effective feature layers; B 1Upsample B 2Upsample B 3Upsample B is obtained by upsampling the corresponding preliminary effective feature layer; 1Downsample B 2Downsample B 3Downsample The corresponding initial effective feature layer is obtained through downsampling; τ*τConv represents a convolution of size τ*τ; μ1 is B 1Upsample The weights of B1 and B2 are: μ1 = B2; μ2 is the weight of B2; μ3 is the weight of B2. 3Downsample The weights of C; ε1 is the weight of C; ε2 is the weight of B. 2shortcut The weight of B; 2shortcut B2 is obtained through a cross-connection shortcut operation; α1 is the weight of B1; α2 is the weight of D. Downsample The weights; α3 is E 2Downsample The weights of D; Downsample D is obtained through downsampling; E 2Downsample It is obtained through E2 downsampling; β1 is the weight of D; β2 is the weight of E. 3Downsample Weights; E 3Downsample E3 is obtained through downsampling; γ1 is D Upsample The weights of B3; γ2 is the weight of B3; D Upsample D is obtained through upsampling. τ*τConv is 1*1Conv, which is a preferred embodiment of the present invention.
[0049] To effectively combine the above models for evaluating detection results, this invention, based on weighted multi-loss calculation, and leveraging the characteristics of RFN structures—such as reusing convolutions across connections and fully utilizing stacked features—introduces a sample association loss (RL) with weights on top of the regression loss GIoULoss. This allows the model to lose the correlation between features, thereby improving training speed, enabling rapid model convergence, and achieving a rapid and smooth decrease in loss. Ultimately, the total loss of this invention can be expressed as: Loss total =λ1Loss GIoU +λ2Loss RL +λ3Loss vl Where λ1 is the weight coefficient of regression loss; λ2 is the weight coefficient of RL loss; and λ3 is the weight coefficient of classification loss. A custom sample association loss RL (Relation Loss) is introduced, defined as follows:
[0050] LossRL (P j P j+1 )=-[(P j -P j-1 )log(P j )+(P j+1 -P j )log(P j+1 )]; where P j-1 P i P i+1 These are the probabilities that the (j-1), j, and j+1th samples belong to the correct category, respectively.
[0051] The loss is calculated as follows:
[0052] To calculate GIoU for two bounding boxes A and B, we can find their minimum convex set (the smallest bounding box enclosing A and B), C. Once we have the minimum convex set, we can calculate GIoU.
[0053]
[0054] Loss GIOU =1-GIoU
[0055] Loss vl The calculation formula is as follows:
[0056]
[0057] Where N represents the total number of samples; y n,l This indicates that the nth sample is predicted to have the lth label; p n,l This indicates that the nth sample is predicted to be the lth label.
[0058] The results of the comparison between GIoU loss and the loss model of this invention in the experiment are as follows: Figure 6 As shown in the experimental results, the use of this loss model can make the network converge faster and the loss curve smoother during training, with good results.
[0059] Step 4: Input the three valid feature YOLO Heads to obtain the object detection results, such as... Figure 8 As shown.
[0060] The specific embodiments of this invention demonstrate excellent performance in experiments on a physical hardware intelligent platform using the Jeston nano as the computing unit and the same dataset. A comparison of the average detection accuracy results of the method model of this invention with those of YOLOv3 training is shown below. Figure 7 As shown. The real-time detection results ported to the actual embedded platform are as follows. Figure 8As shown in Table 1 below, experimental results demonstrate that, on the same Jetson Nano platform, the method of this invention outperforms YOLOv3 in both average detection accuracy and prediction speed.
[0061]
[0062] Table 1
[0063] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.
Claims
1. A target detection method based on RFEF-YOLO, characterized in that, Includes the following steps: Step 1: Acquire an image containing the target to be detected, and perform grayscale preprocessing on the acquired image; Step 2: Extract the backbone features from the preprocessed image using a backbone feature extraction network to obtain three preliminary effective feature layers; Step 3: Use the three preliminary effective feature layers as input to the feature inference fusion network to extract three effective features. The feature inference fusion network adds a sample association loss based on the weights to the regression loss. The specific formula for the sample association loss is as follows: ; in, , , These are the probabilities that the (j-1), j, and j+1th samples belong to the correct category, respectively. The loss function of the feature inference fusion network is: Loss=λ1Loss GIoU +λ2Loss RL +λ3Loss vl; Where λ1 is the weight coefficient of regression loss; λ2 is the weight coefficient of sample association loss; and λ3 is the weight coefficient of classification loss. The feature inference fusion network extracts three effective features using the following formula: ; ; ; ; ; Among them, B1, B2, and B3 are three preliminary effective feature layers; B 1Upsample B 2Upsample B 3Upsample B is obtained by upsampling the corresponding preliminary effective feature layer; 1Downsample B 2Downsample B 3Downsample The corresponding preliminary effective feature layer is obtained through downsampling; Conv indicates a size of Convolution; For B 1Upsample The weights; The weight of B2; For B 3Downsample The weights; The weight of C; For B 2shortcut The weight of B; 2shortcut B2 is obtained through a cross-connection shortcut operation; The weight of B1; D Downsample The weights; For E 2Downsample The weights of D; Downsample D is obtained through downsampling; E 2Downsample It was obtained through E2 downsampling; The weights of D; For E 3Downsample Weights; E 3Downsample It is obtained by downsampling E3; D Upsample The weights; The weight of B3; D Upsample D is obtained through upsampling; Step 4: Use the three valid features as input to the YOLO Head to obtain the object detection results.
2. The RFEF-YOLO-based target detection method as described in claim 1, characterized in that, In step 2, the features extracted in the last three stages of the backbone feature extraction network are used as three preliminary effective feature layers.
3. The target detection method based on RFEF-YOLO as described in claim 1, characterized in that, The Conv is 1*1Conv.