A small target detection model based on yolov5 construction and detection method

By introducing the GAM attention mechanism and the Involution convolution operator into the YOLOv5 model, and adjusting the feature fusion method and anchor box size, the problems of high computational cost and noise interference in small object detection are solved, thereby improving detection accuracy and performance.

CN115331126BActive Publication Date: 2026-06-12JIANGNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGNAN UNIV
Filing Date
2022-08-30
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing multi-scale learning methods are computationally intensive in small object detection and are difficult to avoid interference noise. Furthermore, they do not make full use of contextual information, which makes it difficult to improve detection performance.

Method used

A small object detection model based on Yolov5 is constructed, introducing the GAM attention mechanism and the Involution convolution operator, adjusting the feature fusion method of the head prediction module, and expanding the detection size of the anchor box.

🎯Benefits of technology

It improves the accuracy of small target detection, reduces the amount of computation and parameters, and enhances the detection performance of small targets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115331126B_ABST
    Figure CN115331126B_ABST
Patent Text Reader

Abstract

The application discloses a small target model construction and detection method based on Yolov5, wherein the method firstly improves the Yolov5 network model, introduces an attention mechanism into the model, adopts a new convolution operator for partial convolution, adjusts a feature fusion mode, and increases the detection size of an anchor frame. The model obtained through the above improvement can be applied to target detection, so that the accuracy of small target detection is effectively improved, and the detection performance is improved. Experiments prove that the detection performance of the detection method based on the model is obviously better than that of other detection methods based on other models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of target detection technology, and in particular to a method for constructing and detecting small targets based on Yolov5. Background Technology

[0002] Small object detection has always been a key focus and challenge in object detection. To overcome this problem, researchers have improved various network models adapted for small object detection, but shortcomings still exist. In real-world scenarios, due to the abundance of small objects, small object detection has broad application prospects, playing a crucial role in many fields such as autonomous driving, smart healthcare, defect detection, and aerial image analysis. In recent years, the rapid development of deep learning technology has injected new vitality into small object detection, making it a research hotspot.

[0003] The main challenges currently faced by small object detection are as follows: 1) few available features; 2) high localization accuracy requirements; 3) imbalanced samples. In response to these problems, some scholars have proposed some improvement ideas, such as expanding the size of the training dataset and enriching its diversity through different data augmentation strategies; and improving the performance of small objects through multi-scale learning, such as image feature pyramids, which is a typical method of multi-scale learning.

[0004] However, due to the loss of spatial and detailed feature information, it is difficult to detect small targets in deep feature maps. In deep neural networks, shallow layers have smaller receptive fields, weaker semantic information, and lack of contextual information, but can obtain more spatial and detailed feature information. Based on this idea, Liu et al. proposed a multi-scale target detection algorithm, SSD, which uses shallower feature maps to detect smaller targets and deeper feature maps to detect larger targets. Cai et al., addressing the problem that small targets have limited information and are difficult to match with conventional networks, proposed a unified multi-scale deep convolutional neural network. By using deconvolutional layers to improve the resolution of feature maps, it significantly improves the detection performance of small targets while reducing memory and computational costs.

[0005] In general, multi-scale feature fusion considers both shallow representational information and deep semantic information, which is beneficial for feature extraction of small targets and can effectively improve the detection performance of small targets. However, while improving detection performance, existing multi-scale learning methods also increase the computational cost, and it is difficult to avoid the influence of interference noise during feature fusion. These problems make it difficult to further improve the detection performance of small targets based on multi-scale learning.

[0006] In the real world, there is often a coexistence relationship between "object and scene" and "object and object," and utilizing this relationship can help improve the detection performance of small objects. Before deep learning, research had already demonstrated that proper modeling of context could improve object detection performance, especially for small objects with inconspicuous appearance features. With the widespread application of deep neural networks, some studies have attempted to integrate the context surrounding the object into deep neural networks, achieving some success, but without considering the potential lack of contextual information within the scene. Summary of the Invention

[0007] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0008] In view of the aforementioned existing problems, the present invention is proposed.

[0009] Therefore, one object of the present invention is to provide a method for constructing a small target detection model based on Yolov5.

[0010] To solve the above-mentioned technical problems, the present invention provides the following technical solution: including,

[0011] Construct a dataset that includes training samples and test samples;

[0012] Train a Yolov5-based network model using the dataset;

[0013] The Yolov5-based network model uses the Yolov5 network as its backbone model and includes a feature extraction module, a neck enhancement module, and a head prediction module.

[0014] The neck enhancement module introduces a GAM attention mechanism, and some convolutions employ new convolution operators.

[0015] Adjust the feature fusion method of the head prediction module to expand the detection size of the initial anchor box;

[0016] Output the target detection model after training.

[0017] As a preferred embodiment of the method for constructing a small target detection model based on Yolov5 according to the present invention, the Yolov5-based network model includes:

[0018] Feature extraction module: Responsible for extracting features from the target;

[0019] Neck enhancement module: Enhances the features extracted by the feature extraction module;

[0020] Head prediction module: performs target prediction and obtains detection results.

[0021] As a preferred embodiment of the method for constructing a small target detection model based on Yolov5 as described in this invention, the GAM attention mechanism includes a channel attention submodule and a spatial attention submodule.

[0022] As a preferred embodiment of the method for constructing a small target detection model based on Yolov5 as described in this invention, the channel attention submodule uses a three-dimensional arrangement to retain information in three dimensions and amplifies cross-dimensional channel-space dependencies through a two-layer multilayer perceptron (MLP).

[0023] The spatial attention submodule uses two convolutional layers to fuse spatial information and removes max pooling operations.

[0024] As a preferred embodiment of the method for constructing a small object detection model based on Yolov5 as described in this invention, the GAM attention mechanism can amplify global interaction features and provide a feature mapping, denoted as F1∈R. C×H×W The intermediate state F2 and the output F3 are defined as follows:

[0025]

[0026]

[0027] Where F1 represents the input state, F2 represents the intermediate state, F3 represents the output state, and M... c M s These are the channel attention map and the spatial attention map, respectively. C, H, and W represent the number of channels, image height, and image width, respectively. This indicates that element-wise multiplication is performed.

[0028] As a preferred embodiment of the method for constructing a small object detection model based on Yolov5 as described in this invention, the new convolution operator is a self-convolution operator (Involution), which includes converting the input image from a single pixel x... ij Generate the corresponding kernel H ij , means as follows:

[0029]

[0030] in, Used to index pixels, H i,j This refers to the generated kernel function.

[0031] As a preferred embodiment of the method for constructing a small target detection model based on Yolov5 as described in this invention, the feature fusion method of adjusting the head prediction module includes using QFF as the feature fusion method, and achieving the purpose of feature fusion by setting weight coefficients α, β, and γ.

[0032] As a preferred embodiment of the method for constructing the small target detection model based on Yolov5 described in this invention, the weight coefficients are automatically generated by processing with a 1*1 convolution and a softmax function, and then by backpropagation coefficients, as shown below:

[0033]

[0034]

[0035] Where x is the input at each scale, and y is the feature map output after scale fusion in space. α, β, γ, and δ are the corresponding weight parameters, and the sum of the parameters is 1.

[0036] As a preferred embodiment of the method for constructing a small target detection model based on Yolov5 according to the present invention, the expansion of the detection size of the initial anchor box includes expanding the detection sizes of small targets, medium targets, and large targets from three to seven, respectively, as shown below:

[0037] -[7,9,9,17,17,15,13,27,19,27,44,40,38,94]#P3 / 8

[0038] -[21,28,36,18,23,47,35,33,96,68,86,152,180,137]#P4 / 16

[0039] -[58,29,43,60,82,46,66,88,140,301,303,264,238,542]#P5 / 32

[0040] -[133,77,111,135,206,137,197,290,436,615,739,380,925,792]#P6 / 64

[0041] Another objective of this invention is to provide a method for detecting small targets.

[0042] To solve the above-mentioned technical problems, the present invention provides the following technical solution: including, using a small target detection model obtained based on the construction method of the small target detection model described in claim 1 to detect small targets and obtain detection results.

[0043] The beneficial effects of this invention are:

[0044] 1) This invention provides a small target detection method based on Yolov5, which can effectively improve the accuracy of small target detection and enhance detection performance.

[0045] 2) This invention introduces an attention mechanism into the neck enhancement module of the target detection model. This mechanism removes the pooling operation and can further preserve the feature mapping. At the same time, some convolutions in this module adopt a new convolution operator, Involution. Involution has the characteristics of channel invariance and spatial specificity. In terms of parameter quantity, since the subsequent feature maps of the network are small, the use of this operator can greatly save parameters. In terms of computational quantity, since Involution does not need to integrate multi-channel input when outputting single-pixel results, the computational quantity is reduced by an order of magnitude.

[0046] 3) This invention adjusts the feature fusion method of the head prediction module in the target detection model and increases the detection size of the anchor box, making it better suited for small target detection problems. Attached Figure Description

[0047] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the following description of the embodiments will be briefly introduced. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 This is a flowchart of the method for constructing a small target detection model based on Yolov5 provided in Embodiment 1 of the present invention;

[0049] Figure 2 This is a diagram of the small target detection model based on Yolov5 provided in Embodiment 1 of the present invention;

[0050] Figure 3 Detection results of different models on the Visdrone dataset in Embodiment 3 of this invention;

[0051] Figure 4 Loss functions of different models on the visdrone dataset in Embodiment 3 of this invention;

[0052] Figure 5 Confusion matrices of different models on the visdrone dataset in Embodiment 3 of this invention;

[0053] Figure 6 Comparison of different models and the original model on the visdrone dataset in Embodiment 3 of the present invention;

[0054] Figure 7The accuracy and recall results of different models on the visdrone dataset in Example 3 of this invention are shown in the figure. Detailed Implementation

[0055] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0056] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0057] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0058] This invention is described in detail with reference to the schematic diagrams. When detailing the embodiments of this invention, for ease of explanation, the cross-sectional views illustrating the device structure may be partially enlarged, not adhering to the usual scale. Furthermore, the schematic diagrams are merely examples and should not be construed as limiting the scope of protection of this invention. In actual fabrication, the three-dimensional spatial dimensions of length, width, and depth should be included.

[0059] Furthermore, in the description of this invention, it should be noted that the terms "upper," "lower," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are used solely for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. In addition, the terms "first," "second," or "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0060] Unless otherwise explicitly specified and limited, the terms "installation," "connection," and "joining" in this invention should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; similarly, they can refer to mechanical connections, electrical connections, or direct connections, or indirect connections through an intermediate medium, or internal connections between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0061] Example 1

[0062] Reference Figures 1-2 This embodiment provides a method for constructing a small target detection model based on Yolov5. Figure 1 This paper demonstrates a method for constructing a small object detection model based on Yolov5, including:

[0063] S1: Construct a dataset that includes training samples and test samples;

[0064] Furthermore, images containing the targets to be tested are selected, and the positions of each target in the images are labeled. The labeled data is converted into YOLO format, and the coordinate information of each target is used to construct a target detection and localization dataset.

[0065] S2: Train the Yolov5-based network model using the dataset;

[0066] Furthermore, based on the Yolov5 network model, it can be seen that... Figure 2 This paper illustrates the network model structure diagram based on Yolov5 of the present invention, including:

[0067] Feature extraction module: Responsible for extracting features from the target;

[0068] Neck enhancement module: Enhances the features extracted by the feature extraction module;

[0069] Head prediction module: performs target prediction and obtains detection results.

[0070] Furthermore, the Yolov5-based network model uses the Yolov5 network as its backbone model.

[0071] It should be noted that the YOLO series of network models are the most classic one-stage algorithms in the field of object detection research;

[0072] A1: Introduce the GAM attention mechanism into the neck enhancement module;

[0073] Furthermore, the GAM attention mechanism includes a channel attention submodule and a spatial attention submodule:

[0074] The channel attention submodule uses a three-dimensional arrangement to retain information in three dimensions, and amplifies cross-dimensional channel-space dependencies through a two-layer multilayer perceptron (MLP).

[0075] It should be noted that the multilayer perceptron (MLP) is an encoder-decoder structure, similar to the BAM, with a compression ratio of r.

[0076] The spatial attention submodule uses two convolutional layers to fuse spatial information.

[0077] It should be noted that the spatial attention submodule uses the same reduction ratio r as the channel attention module, while removing the pooling operation to further preserve the feature map. Therefore, the spatial attention module sometimes significantly increases the number of parameters.

[0078] Furthermore, the GAM attention mechanism can amplify global interaction features and provide a feature map, denoted as F1∈R. C×H×W The intermediate state F2 and the output F3 are defined as follows:

[0079]

[0080]

[0081] Where F1 represents the input state, F2 represents the intermediate state, F3 represents the output state, and M... c M s These are the channel attention map and the spatial attention map, respectively. C, H, and W represent the number of channels, image height, and image width, respectively. This indicates element-wise multiplication.

[0082] It should be noted that the GAM attention mechanism can amplify global interactive features and, given a feature map, reduce information diffusion.

[0083] A2: Some convolutions in the neck enhancement module use a new convolution operator;

[0084] Furthermore, the new convolution operator is the self-convolution operator Involution, which transforms the input image from a single pixel x... ij Generate the corresponding kernel H ij , means as follows:

[0085]

[0086] in, Used to index pixels, H i,j This refers to the generated kernel function.

[0087] It should be noted that the kernel of ordinary convolution enjoys two basic characteristics: spatial invariance and channel specificity. Involution, on the contrary, has both channel invariance and spatial specificity, transforming the channel specificity and space sharing of ordinary convolution into channel sharing and spatial specificity.

[0088] Furthermore, in terms of parameter count, ordinary convolution and involution are C*K*K*C and H*W*K*K*C respectively. Due to the small size of the subsequent feature maps in the network, involution can significantly reduce the number of parameters. In terms of computational cost, excluding the kernel generation part, ordinary convolution and involution are H*W*C*K*K*C and H*W*K*K*C respectively. Since involution does not need to integrate multi-channel inputs when outputting a single pixel result like convolution, the computational cost is reduced by an order of magnitude.

[0089] A3: Adjust the feature fusion method of the head prediction module;

[0090] Furthermore, QFF is adopted as the feature fusion method, and the feature fusion purpose is achieved by setting weight coefficients α, β, and γ.

[0091] It should be noted that QFF is an adaptive feature fusion method for object detection that can improve the scale invariance of features.

[0092] Furthermore, the weight coefficients α, β, and γ are automatically generated through backpropagation coefficients after processing by a 1*1 convolution and the softmax function, as shown below:

[0093]

[0094]

[0095] Where x is the input at each scale, and y is the feature map output after scale fusion in space. α, β, and γ are the corresponding weight parameters, and the sum of the parameters is 1.

[0096] A4: Expand the detection size of the initial anchor box in the head prediction module;

[0097] Furthermore, the detection dimensions for small, medium, and large targets are expanded from three to seven, as shown below:

[0098] -[7,9,9,17,17,15,13,27,19,27,44,40,38,94]#P3 / 8

[0099] -[21,28,36,18,23,47,35,33,96,68,86,152,180,137]#P4 / 16

[0100] -[58,29,43,60,82,46,66,88,140,301,303,264,238,542]#P5 / 32

[0101] -[133,77,111,135,206,137,197,290,436,615,739,380,925,792]#P6 / 64 It should be noted that the anchor box is the pixel box used for prediction in the object detection model.

[0102] S3: Output the target detection model after training.

[0103] Example 2

[0104] To verify the beneficial effects of the present invention, this embodiment is scientifically demonstrated through practical application.

[0105] The dataset used in this embodiment is the Visdrone dataset, which is a publicly available drone image feature set, including 6471 training samples and 1610 test samples. The dataset format labeled in the Visdrone dataset is converted into YOLO format.

[0106] Table 1 shows the performance comparison of the target detection methods based on different methods using Yolov5 as the backbone model in this invention on the Visdrone dataset:

[0107] Table 1. Detection performance of models built using different methods

[0108]

[0109] To improve the performance of traditional YOLOv5 object detection, this invention introduces an attention mechanism into the neck enhancement module of the YOLOv5-based object detection model. This mechanism removes pooling operations, which can further preserve feature maps. At the same time, some convolutions in this module adopt a new convolution operator, Involution. Involution has the characteristics of channel invariance and spatial specificity. In terms of parameter quantity, since the subsequent feature maps of the network are small, the use of this operator can greatly save parameters. In terms of computational cost, since Involution does not need to integrate multi-channel input when outputting single-pixel results, the computational cost is reduced by an order of magnitude.

[0110] Meanwhile, the feature fusion method of the head prediction module in the object detection model was adjusted, and the detection size of the anchor box was increased, making it more suitable for small object detection problems. As can be seen from the results in the table above, the above improvements made in this invention can significantly improve the object detection performance of traditional Yolov5.

[0111] Example 3

[0112] Reference Figures 3-7In another embodiment of the present invention, which differs from the previous two embodiments, in order to verify and illustrate the technical effects of the method, this embodiment uses a traditional technical solution to conduct a comparative test with the method of the present invention, and compares the test results with scientific demonstration methods to verify the real effect of the method.

[0113] This embodiment also uses the visdrone dataset for comparative testing.

[0114] This embodiment compares the Yolo-h method of the present invention with several existing advanced methods, including Yolov3, Yolov5s, Yolov5m, and Yolov5l. The comparison results are shown in Table 2.

[0115] Table 2 Comparative Experiments

[0116]

[0117] As can be seen from Table 2, compared with existing small target detection models and the original model, the small target detection model based on the Yolov5 network structure provided by this invention has better performance in detection.

[0118] Figure 3 This diagram shows the comparative experimental results on the Visdrone dataset in this embodiment. Analysis reveals that Yolo-h exhibits higher detection accuracy. The confusion matrix diagram of this embodiment is shown below. Figure 4 As shown, the loss function graph is as follows: Figure 5 As shown in the comparison diagram between this embodiment and the original model, see below. Figure 6 As shown in the figure, the graph of precision versus recall is as follows: Figure 7 As shown.

[0119] The results above clearly show that the method of the present invention, by improving the Yolov5 network model and applying it to target detection, can effectively improve the accuracy of small target detection and enhance detection performance. Compared with other existing methods, it has higher retrieval accuracy.

[0120] It should be recognized that embodiments of the present invention can be implemented or carried out by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer-readable storage medium. The method can be implemented using standard programming techniques—including a non-transitory computer-readable storage medium configured with a computer program, wherein such a storage medium causes the computer to operate in a specific and predefined manner—according to the methods and drawings described in the specific embodiments. Each program can be implemented in a high-level procedural or object-oriented programming language to communicate with the computer system. However, if desired, the program can be implemented in assembly or machine language. In any case, the language can be a compiled or interpreted language. Furthermore, for this purpose, the program can run on a programmed application-specific integrated circuit (ASIC).

[0121] Furthermore, the procedures described herein may be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by the context. The procedures described herein (or variations and / or combinations thereof) may be executed under the control of one or more computer systems configured with executable instructions, and may be implemented by hardware or a combination thereof as code (e.g., executable instructions, one or more computer programs, or one or more applications) that commonly executes on one or more processors. The computer program comprises a plurality of instructions executable by one or more processors.

[0122] Furthermore, the method can be implemented in any suitable type of computing platform, including but not limited to personal computers, minicomputers, mainframes, workstations, networked or distributed computing environments, standalone or integrated computer platforms, or in communication with charged particle tools or other imaging devices, etc. Aspects of the invention can be implemented as machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optical read and / or write storage medium, RAM, ROM, etc., such that it is readable by a programmable computer, and when the storage medium or device is read by the computer, it can be used to configure and operate the computer to perform the processes described herein. Furthermore, the machine-readable code, or portions thereof, can be transmitted via wired or wireless networks. The invention described herein includes these and other different types of non-transitory computer-readable storage media when such media comprises instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. When programmed according to the methods and techniques described herein, the invention also includes the computer itself. A computer program can be applied to input data to perform the functions described herein, thereby transforming the input data to generate output data stored in non-volatile memory. The output information can also be applied to one or more output devices such as a display. In a preferred embodiment of the invention, the converted data represents physical and tangible objects, including specific visual depictions of physical and tangible objects generated on a display.

[0123] As used herein, the terms “component,” “module,” “system,” etc., are intended to refer to a computer-related entity, which may be hardware, firmware, a combination of hardware and software, software, or running software. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable file, a running thread, a program, and / or a computer. As an example, an application running on a computing device and the computing device itself can both be components. One or more components may reside in a running process and / or thread, and components may be located in a single computer and / or distributed among two or more computers. Furthermore, these components are capable of execution from various computer-readable media having various data structures thereon. These components may communicate locally and / or remotely via signals, such as based on one or more data packets (e.g., data from a component that interacts with a local system, another component in a distributed system, and / or signals that interact with other systems via a network such as the Internet).

[0124] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A method for constructing a small target detection model based on Yolov5, characterized in that: include, Construct a dataset that includes training samples and test samples; Train a Yolov5-based network model using the dataset; The Yolov5-based network model uses the Yolov5 network as its backbone model and includes a feature extraction module, a neck enhancement module, and a head prediction module. The neck enhancement module introduces a GAM attention mechanism, and some convolutions employ new convolution operators. The GAM attention mechanism includes a channel attention submodule and a spatial attention submodule. The channel attention submodule uses a three-dimensional arrangement to retain information in three dimensions and amplifies cross-dimensional channel-spatial dependencies through a two-layer multilayer perceptron (MLP). The spatial attention submodule uses two convolutional layers to fuse spatial information and removes max pooling operations. The GAM attention mechanism amplifies global interaction features and provides a feature mapping, represented as follows: Among them, the intermediate state and output The definition is as follows: ; in, Represents the input status. Represents an intermediate state. Represents the output status. These are channel attention maps and spatial attention maps, respectively. C, H, W These represent the number of channels, image height, and image width, respectively. This indicates element-wise multiplication. The new convolution operator is the self-convolution operator Involution, which includes converting the input image from a single pixel... Generate the corresponding kernel , means as follows: ; in, Used to index pixels, For the generated kernel function; Adjust the feature fusion method of the head prediction module to expand the detection size of the initial anchor box; The feature fusion method of the adjusted head prediction module includes using QFF as the feature fusion method, and setting weight coefficients. To achieve; The weight coefficients are automatically generated through 1x1 convolution, softmax function processing, and backpropagation coefficients, as shown below: ; ; in, Each scale is an input. It is the feature map output after spatial scale fusion. These are the corresponding weight parameters, and the sum of the parameters is 1. The expansion of the initial anchor box detection size includes expanding the detection size of small, medium, and large targets from three to seven, respectively, as shown below: - [7,9, 9,17, 17,15, 13,27,19,27, 44,40, 38,94] # P3 / 8 - [21,28, 36,18, 23,47, 35,33,96,68, 86,152, 180,137] # P4 / 16 - [58,29,43,60,82,46,66,88,140,301,303,264, 238,542] # P5 / 32 - [133,77,111,135,206,137,197,290,436,615, 739,380, 925,792] # P6 / 64; Output the target detection model after training.

2. The method for constructing a small target detection model based on Yolov5 as described in claim 1, characterized in that: The Yolov5-based network model includes, Feature extraction module: Responsible for extracting features from the target; Neck enhancement module: Enhances the features extracted by the feature extraction module; Head prediction module: performs target prediction and obtains detection results.

3. A method for detecting small targets using a Yolov5-based detection model constructed using the construction method described in claim 1, characterized in that: This includes using a small target detection model obtained by applying the construction method of the small target detection model described in claim 1 to detect small targets and obtain detection results.