Target detection model training method, target detection method, device, equipment and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By training a target detection model using multi-layer feature extraction and feature fusion, the problem of low detection accuracy for smaller targets is solved, and accurate detection of smaller targets is achieved.

CN122265623APending Publication Date: 2026-06-23SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD +2

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
Filing Date: 2024-12-20
Publication Date: 2026-06-23

Application Information

Patent Timeline

20 Dec 2024

Application

23 Jun 2026

Publication

CN122265623A

IPC: G06V10/25; G06V10/44; G06V10/80; G06V10/82; G06N3/045; G06N3/0464; G06N3/08

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Technology Topics

Pattern recognitionNetwork output

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies suffer from the loss of feature information when detecting smaller targets, resulting in low detection accuracy.

Method used

A multi-layer feature extraction and feature fusion method is adopted. Multi-layer feature maps are extracted through the first initial network, feature fusion is performed using the second initial network, and target detection is performed by combining the third initial network. The target detection model is trained using the convergence condition of the target loss function value.

Benefits of technology

It effectively avoids the loss of multi-layer information for smaller targets and improves the detection accuracy for smaller targets.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122265623A_ABST

Patent Text Reader

Abstract

The application discloses a target detection model training method, a target detection method, a device, equipment and a medium. The method comprises the following steps: acquiring a to-be-trained image, the to-be-trained image corresponding to at least one training target label frame; inputting the to-be-trained image into a first initial network to output a plurality of multi-layer feature maps; inputting the plurality of multi-layer feature maps into a second initial network to output a plurality of fusion feature maps; inputting the plurality of fusion feature maps into a third initial network to output at least one predicted target detection frame; determining a target loss function value of an initial detection model based on the at least one predicted target detection frame and the training target label frame corresponding to each predicted target detection frame; and when the target loss function value meets a preset convergence condition, determining a target detection model corresponding to the initial detection model based on the first initial network, the second initial network and the third initial network. The method can improve the detection precision of detecting smaller to-be-detected targets.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image detection technology, and in particular to a target detection model training method, target detection method, apparatus, equipment and medium. Background Technology

[0002] Currently, target detection technology is widely used in scenarios such as autonomous driving, security monitoring, and road and bridge defect detection. For example, it's used for obstacle detection in autonomous driving scenarios, for detecting objects requiring real-time monitoring in security monitoring scenarios, and for detecting defects such as cracks and corrosion in road and bridge scenarios. In each application scenario, the smaller the target size, the more difficult the detection becomes. For instance, in road and bridge scenarios, if the target is a small area of corrosion, detecting such a small area of corrosion within that scenario is quite difficult.

[0003] To detect smaller targets in various application scenarios, existing technologies typically employ convolutional neural networks (CNNs) to perform target detection on smaller targets in images acquired within those scenarios. This method achieves detection by continuously reducing the size of the corresponding feature map. However, in the process of continuously reducing the feature map size, existing technologies sometimes lose feature information corresponding to smaller targets, leading to lower detection accuracy for these smaller targets. Therefore, improving the detection accuracy for smaller targets is a pressing technical problem that needs to be solved. Summary of the Invention

[0004] This invention provides a target detection model training method, target detection method, apparatus, device, and medium to address the problem of improving the detection accuracy of small targets.

[0005] A method for training an object detection model, comprising: Acquire a training image, wherein the training image corresponds to at least one training target label box; the size of the training target label box is smaller than a first size threshold. The image to be trained is input into the first initial network in the initial detection model, and multiple multi-layer feature maps corresponding to the image to be trained are output. Multiple multi-layer feature maps are input into the second initial network in the initial detection model, and multiple fused feature maps corresponding to the image to be trained are output. Multiple fused feature maps are input into the third initial network in the initial detection model, and at least one predicted target detection box is output. Based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box, the target loss function value of the initial detection model is determined; When the target loss function value satisfies the preset convergence condition, the target detection model corresponding to the initial detection model is determined based on the first initial network, the second initial network, and the third initial network.

[0006] A target detection method, comprising: Acquire the image to be detected; The target detection model trained using the above-mentioned target detection model training method is used to perform target detection on the image to be detected, thereby obtaining a predicted target detection box for at least one target corresponding to the image to be detected.

[0007] A target detection model training device, comprising: The training image acquisition module is used to acquire a training image, wherein the training image corresponds to at least one training target label box; the size of the training target label box is smaller than a first size threshold. The multi-layer feature map acquisition module is used to input the image to be trained into the first initial network in the initial detection model and output multiple multi-layer feature maps corresponding to the image to be trained. The fusion feature map acquisition module is used to input multiple multi-layer feature maps into the second initial network in the initial detection model and output multiple fusion feature maps corresponding to the image to be trained. The predicted target detection box acquisition module is used to input multiple fused feature maps into the third initial network in the initial detection model and output at least one predicted target detection box. The target loss function value determination module determines the target loss function value of the initial detection model based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box. The target detection model determination module is used to determine the target detection model corresponding to the initial detection model based on the first initial network, the second initial network, and the third initial network when the target loss function value satisfies the preset convergence condition.

[0008] A target detection device, characterized in that it comprises: The image acquisition module is used to acquire the image to be detected. The detection module is used to perform target detection on the image to be detected using a trained target detection model, and to obtain a predicted target detection box for at least one target to be detected corresponding to the image to be detected.

[0009] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the target detection model training method described above, or the processor implements the target detection method described above when it executes the computer program.

[0010] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described target detection model training method, or, when executed by a processor, implements the above-described target detection method.

[0011] The aforementioned object detection model training method, object detection method, apparatus, device, and medium employ a first initial network to extract features corresponding to multi-layer information in the image to be trained, obtaining multiple multi-layer feature maps. This effectively avoids the loss of multi-layer information corresponding to smaller targets in the image to be detected. A second initial network is then used to fuse features from the multiple multi-layer feature maps, obtaining multiple feature fusion maps. This facilitates accurate detection of smaller targets based on the feature fusion maps containing multi-layer information corresponding to them. The trained object detection model is obtained when the target loss function converges. The object detection model trained using the above method can accurately extract, fuse, and detect multi-layer information corresponding to smaller targets in an image, thereby improving the detection accuracy for smaller targets. Attached Figure Description

[0012] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 This is a flowchart of a target detection model training method according to an embodiment of the present invention; Figure 2 This is another flowchart of the target detection model training method in one embodiment of the present invention; Figure 3 This is another flowchart of the target detection model training method in one embodiment of the present invention; Figure 4 This is another flowchart of the target detection model training method in one embodiment of the present invention; Figure 5 This is another flowchart of the target detection model training method in one embodiment of the present invention; Figure 6 This is a flowchart of a target detection method according to an embodiment of the present invention; Figure 7 This is a schematic diagram of a target detection model training device in one embodiment of the present invention; Figure 8 This is a schematic diagram of a computer device according to an embodiment of the present invention; Figure 9 This is a structural diagram of the initial detection model in one embodiment of the present invention; Figure 10 This is a structural diagram of a multi-scale convolutional unit in one embodiment of the present invention. Detailed Implementation

[0014] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0015] The target detection model training method provided in this embodiment of the invention is used to improve the detection accuracy of small targets.

[0016] In one embodiment, such as Figure 1 As shown, a method for training an object detection model is provided, which can be applied to... Figure 8 Taking a computer device as an example, the explanation includes the following steps: S101: Obtain the image to be trained, which corresponds to at least one training target label box; the size of the training target label box is smaller than a first size threshold. S102: Input the image to be trained into the first initial network in the initial detection model, and output multiple multi-layer feature maps corresponding to the image to be trained; S103: Input multiple multi-layer feature maps into the second initial network in the initial detection model, and output multiple fused feature maps corresponding to the image to be trained; S104: Input multiple fused feature maps into the third initial network in the initial detection model, and output at least one predicted target detection box; S105: Determine the target loss function value of the initial detection model based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box; S106: When the target loss function value satisfies the preset convergence condition, determine the target detection model corresponding to the initial detection model based on the first initial network, the second initial network, and the third initial network.

[0017] Here, the image to be trained refers to the image used for model training. The training target label box refers to the detection box corresponding to the smaller region where the target to be detected is located, serving as the label for model training. The first size threshold is used to limit the maximum size of the training target label box. Understandably, since the training target label box is smaller than the first size threshold, the size of the smaller target to be detected is also smaller than the first size threshold.

[0018] As an example, in step S101, the computer device acquires a training image for model training. This training image corresponds to at least one training target label box for a smaller target to be detected. In this example, the computer device acquires multiple images from the application scenario as training images, and labels at least one smaller target to be detected in each training image with a bounding box according to a standard of being smaller than a first size threshold, thus obtaining at least one training target label box. For example, the computer device acquires a road image in a road defect detection scenario that includes at least one road defect smaller than the first size threshold, uses this road image as a training image, and labels each road defect smaller than the first size threshold in the training image, thus obtaining multiple training target label boxes.

[0019] The initial detection model is used for object detection, specifically for detecting smaller objects in the training image. The first initial network is the network in the initial detection model used for image feature extraction. Multi-layer feature maps refer to feature maps obtained by extracting features from multiple levels of information in the training image.

[0020] As an example, in step S102, the computer device inputs the training image collected in the application scenario into the first initial network of the initial detection model. The first initial network extracts features from the low-level, mid-level, and high-level information corresponding to the training image, outputting multiple multi-layer feature maps corresponding to the training image. For example, the computer device uses the first initial network to extract features from the multi-layer information corresponding to the training image acquired in a road defect detection scenario, obtaining multiple multi-layer feature maps. The high-level information contains rich image semantic information, the mid-level information contains some image semantic information and partial location information corresponding to each target to be detected in the image, and the low-level information contains rich location information corresponding to each target to be detected in the image. Understandably, extracting features from the low-level, mid-level, and high-level information of the training image containing at least one smaller target to be detected achieves feature extraction of the low-level, mid-level, and high-level information of the smaller target to be detected, avoiding the loss of information from each layer of the smaller target to be detected in the training image. This makes it feasible to subsequently perform more accurate detection of the smaller target to be detected in the training image based on the multi-layer feature maps containing multi-layer information.

[0021] The second initial network is the network used for image feature fusion in the initial detection model.

[0022] As an example, in step S103, the computer device inputs multiple multi-layer feature maps corresponding to the image to be trained into the second initial network of the initial detection model. The second initial network performs feature fusion processing on the multiple multi-layer feature maps to obtain multiple fused feature maps corresponding to the image to be trained. For example, the computer device uses the second initial network to perform feature fusion on multiple multi-layer feature maps corresponding to the image to be trained in a road defect detection scenario to obtain multiple fused feature maps. Understandably, the multiple multi-layer feature maps include high-level, mid-level, and low-level information of the image to be trained. During the detection of smaller targets in the image to be trained, feature fusion processing is performed on the multiple multi-layer feature maps containing high-level, mid-level, and low-level information to obtain multiple fused feature maps. This facilitates the detection of high-level, mid-level, and low-level information corresponding to smaller targets when performing feature detection on the fused feature maps, thereby improving the target detection accuracy for smaller targets.

[0023] The third initial network refers to the network used for object detection in the initial detection model. The predicted object detection box refers to the detection box containing a small target object obtained after object detection.

[0024] As an example, in step S104, the computer device inputs multiple fused feature maps corresponding to the image to be trained into the third initial network of the initial detection model. The third initial network is used to perform target detection on the multiple fused feature maps to obtain at least one predicted target detection box. Understandably, the fused feature maps integrate high-level information containing rich image semantic information, mid-level information containing image semantic information and the location information of the target to be detected, and low-level information containing rich location information of the target to be detected. That is, the fused feature maps contain relatively comprehensive feature information corresponding to smaller targets to be detected in the image. Using the third initial network to perform target detection on the fused feature maps can effectively improve the detection accuracy for smaller targets to be detected.

[0025] The target loss function value refers to the loss function value generated during the initial training of the detection model.

[0026] As an example, in step S105, the computer device uses a preset matching algorithm to determine the training target label box corresponding to each predicted target detection box. Based on the difference between each predicted target detection box and the training target label box, it calculates the loss function value to determine the target loss function value of the initial detection model during training. For example, the computer device obtains the center point coordinates of each predicted target detection box and the center point coordinates of each training target label box, and obtains the distance between the center point coordinates of each predicted target detection box and the center point coordinates of each training target label box. The training target label box with the smallest distance to the center point of the predicted target detection box is determined as the training target label box that matches the predicted target detection box. The above method is used to determine the training target label box that matches each predicted target detection box. The computer device calculates the difference between each predicted target detection box and the training target label box that matches each predicted target detection box to obtain the difference loss function value corresponding to each predicted target detection box. A preset loss algorithm is used to process the difference loss function value corresponding to each predicted target detection box to obtain the target loss function value of the initial detection model during training. The preset loss algorithm includes, but is not limited to, weighted summation of the difference loss function values corresponding to all predicted target detection boxes. In this example, the training target label boxes are used as training labels. The target loss function value is determined based on each predicted target detection box and its corresponding training target label box. This directly reflects the training level of the initial detection model, thus enabling the acquisition of a target detection model capable of accurately detecting smaller targets.

[0027] Among them, the preset convergence condition refers to the preset condition for judging whether the initial detection model has converged during the training process.

[0028] As an example, in step S106, the computer device determines whether the target loss function value meets the preset convergence condition. If it determines that the target loss function value does not meet the preset convergence condition, it updates the first initial network, the second initial network, and the third initial network in the initial detection model, and repeats steps S101 to S106 to continue training the initial detection model until it is determined that the target loss function value meets the preset convergence condition, and the first initial network, the second initial network, and the third initial network are determined to be trained successfully. The trained first initial network, the trained second initial network, and the trained third initial network are determined to be the trained initial detection model, and the trained initial detection model is determined to be the target detection model used for accurate detection of small targets. In this example, when the target loss function value has not converged, the first initial network, the second initial network, and the third initial network in the initial detection model are updated until the target loss function value converges, thus obtaining the target detection model. This method trains and updates a first initial network for extracting features corresponding to multi-layer information, a second initial network for fusing features corresponding to multi-layer information, and a third initial network for detecting smaller targets in the fused feature map. This results in a target detection model capable of accurately extracting, fusing, and detecting multi-layer information corresponding to smaller targets in an image. This target detection model avoids the loss of multi-layer information corresponding to smaller targets in the image during image processing and can fuse multi-layer feature maps corresponding to multi-layer information, thereby detecting multi-layer information and improving detection accuracy.

[0029] In this embodiment, a first initial network is used to extract features corresponding to the multi-layer information of the image to be trained, resulting in multiple multi-layer feature maps. This effectively avoids the loss of multi-layer information corresponding to smaller targets in the image to be trained. A second initial network is then used to fuse the features of the multiple multi-layer feature maps, resulting in multiple feature fusion maps. This allows for accurate detection of smaller targets based on the feature fusion maps containing multi-layer information corresponding to them. When the target loss function converges, the trained target detection model is obtained. The target detection model trained using the above method can accurately extract, fuse, and detect multi-layer information corresponding to smaller targets in the image, thereby improving the detection accuracy for smaller targets.

[0030] In one embodiment, the first initial network includes a first vector convolution module, multiple hybrid convolution modules, a pooling module, and an attention mechanism module connected in sequence; each hybrid convolution module includes vector convolution units and multi-scale convolution units connected in sequence. The multiple hybrid convolution modules are a first hybrid convolution module, a second hybrid convolution module, a third hybrid convolution module, and a fourth hybrid convolution module.

[0031] As an example, such as Figure 9 The diagram shown is a structural diagram of an initial detection model in one embodiment, consisting of... Figure 9 It can be seen that the first initial network includes a first vector convolution module (Conv), a first hybrid convolution module, a second hybrid convolution module, a third hybrid convolution module, a fourth hybrid convolution module, a fast spatial pyramid pooling module (SPPF), and an attention mechanism module (C2PSA) connected in sequence. The first vector convolution module includes one vector convolution unit, and each hybrid convolution module includes one vector convolution unit (Conv) and one multi-scale convolution unit (MSC) connected in sequence. Figure 10 The diagram shown is a structural diagram of a multi-scale convolutional unit, consisting of... Figure 10 It can be seen that the multi-scale convolutional unit includes four vector convolutional units (Conv) with different kernel sizes, a concatenation unit (Contact), and a SimAM attention mechanism unit (Simple, Parameter-Free Attention Module). Specifically, Figure 10 In this embodiment, the kernel sizes of the four vector convolutional units (Conv) are 1*1, 3*3, 5*5, and 7*7, respectively. The multi-scale convolutional unit (MSC) employs vector convolutional units (Conv) with different kernel sizes, effectively extracting features corresponding to multi-scale information of smaller targets in the image. This avoids the loss of information corresponding to smaller targets and improves the detection accuracy for them. The multi-scale convolutional unit (MSC) uses the SimAM attention mechanism to ensure that the extracted and fused features converge in the image region corresponding to the smaller target during feature extraction and fusion. When applied to the target detection model, the multi-scale convolutional unit (MSC) improves the detection accuracy for smaller targets.

[0032] In one embodiment, such as Figure 2 As shown, step S102 involves inputting the image to be trained into the first initial network of the object detection model, and outputting multiple multi-layer feature maps corresponding to the image to be trained, including: S201: Using the first vector convolution module and the first hybrid convolution module, the training image is processed by the first vector convolution, the second vector convolution, and the multi-scale convolution to obtain the low-level feature map corresponding to the training image. S202: The second hybrid convolution module is used to perform vector convolution and multi-scale convolution on the low-level feature map to obtain the first middle-level feature map; S203: The third hybrid convolution module is used to perform vector convolution and multi-scale convolution on the first middle layer feature map to obtain the second middle layer feature map; S204: The fourth hybrid convolution module, pooling module, and attention mechanism module are used to perform vector convolution, multi-scale convolution, downsampling, and feature enhancement on the second middle layer feature map to obtain the high-level feature map.

[0033] The first vector convolution process refers to the feature extraction process of the training image using the first vector convolution module. The second vector convolution process refers to the feature extraction process of the feature map output by the first vector convolution module using the vector convolution units in the second vector convolution module. The multi-scale convolution process refers to the multi-scale feature extraction process of the feature map.

[0034] Here, the low-level feature map refers to the feature map corresponding to the low-level information. Low-level features refer to features containing the low-level information of an image. This low-level information includes rich positional information corresponding to each object in the image and coarse semantic information corresponding to each object. Semantic information includes, but is not limited to, visually apparent contours, edges, colors, textures, and shapes. Understandably, because the low-level feature map is large and has sufficient resolution, the positional information corresponding to each object in the low-level feature map is relatively rich. However, the large size of the low-level feature map leads to a relatively coarse semantic information such as contours, edges, colors, textures, and shapes corresponding to each object in the image.

[0035] As an example, in step S201, the computer device inputs the image to be trained into the first vector convolution module of the first initial network. Using the first vector convolution module and the first hybrid convolution module connected in sequence, features are extracted from the image to obtain the low-level feature map corresponding to the image to be trained. In this example, the computer device inputs the image to be trained into the first vector convolution module of the first initial network. The first vector convolution module performs a first vector convolution process on the image to be trained, obtaining a first feature map after feature extraction. The vector convolution unit in the first hybrid convolution module performs a second vector convolution process on the first feature map, obtaining a second feature map after feature extraction. The hybrid convolution unit in the first hybrid convolution module performs multi-scale convolution processing on the second feature map, obtaining a low-level feature map after multi-scale feature extraction. Figure 9It is known that each hybrid convolutional module includes a vector convolutional unit and a multi-scale convolutional unit connected in sequence. Understandably, the vector convolutional unit and the multi-scale convolutional unit are used to extract features corresponding to the information contained in the image to be trained, obtaining a feature map corresponding to the image to be trained. The size of the feature map is halved and the number of feature channels doubles after each vector convolutional unit. In this example, when only two vector convolutional units are used to extract features from the information of the image to be trained, the extracted feature map is relatively large and has sufficient resolution, capable of extracting low-level feature maps containing the location information of small targets to be detected. Furthermore, by Figure 10 As can be seen, the multi-scale convolutional unit (MSC) includes vector convolutional units (Conv) with different kernel sizes and SimAM attention mechanism units. Since the vector convolutional units (Conv) with different kernel sizes can effectively extract the features corresponding to the multi-scale information of the smaller target to be detected in the image, and the SimAM attention mechanism unit can concentrate the extracted features in the image region corresponding to the smaller target to be detected when performing feature extraction on the image, applying the multi-scale convolutional unit (MSC) to the first hybrid convolutional module can effectively avoid the loss of the low-level information corresponding to the smaller target to be detected, so that the obtained low-level feature map contains more comprehensive low-level information corresponding to the smaller target to be detected.

[0036] The first mid-layer feature map refers to the feature map containing mid-layer features obtained by extracting features from the bottom-layer feature map. Mid-layer features are features containing mid-level information of the image. This mid-layer information includes relatively coarse positional information corresponding to each object in the image and relatively coarse semantic information such as contours, edges, colors, textures, and shapes corresponding to each object. Understandably, the first mid-layer feature map is half the size of the bottom-layer feature map, and while the semantic information such as contours, edges, colors, textures, and shapes corresponding to each object is increased, it is still relatively coarse and requires further feature extraction. The positional information corresponding to each object is also relatively coarse compared to the positional information corresponding to each object in the bottom-layer feature map.

[0037] As an example, in step S202, the computer device inputs the low-level feature map into the second hybrid convolutional module of the first initial network, and uses the second hybrid convolutional module of the first initial network to perform mid-level feature extraction on the low-level feature map to obtain the first mid-level feature map. In this example, the computer device uses the vector convolutional units of the second hybrid convolutional module to extract features from the low-level feature map to obtain the feature map corresponding to the low-level feature map, and uses the multi-scale convolutional units of the second hybrid convolutional module to perform multi-scale feature extraction on the feature map corresponding to the low-level feature map to obtain the first mid-level feature map. Figure 9It can be seen that the second hybrid convolutional module includes a vector convolutional unit and a multi-scale convolutional unit connected in sequence. After the low-level feature map passes through the vector convolutional unit in the second hybrid convolutional module, the feature map size is halved. The multi-scale convolutional unit performs multi-scale feature extraction on the low-level feature map after the feature map size is halved, extracting the features corresponding to the middle-level information contained in the low-level feature map to obtain the first middle-level feature map. Understandably, after halving the size of the low-level feature map, the semantic information such as contours, edges, colors, textures, and shapes corresponding to each object in the resulting feature map increases. The multi-scale convolutional unit can extract the features corresponding to the semantic information such as contours, edges, colors, textures, and shapes corresponding to each object in the low-level feature map, as well as the features corresponding to the positional information. Furthermore, by... Figure 10 It is known that the SimAM attention mechanism unit included in the multi-scale convolutional unit (MSC) can concentrate the extracted multi-scale features into the image region corresponding to the smaller target when performing multi-scale feature extraction on the low-level feature map. Applying the multi-scale convolutional unit (MSC) to the second hybrid convolutional module can effectively avoid the loss of middle-level information corresponding to the smaller target and obtain the first middle-level feature map.

[0038] The second mid-layer feature map refers to the feature map after feature extraction from the first mid-layer feature map. Compared with the first mid-layer feature map, the second mid-layer feature map is further reduced in size, and the second mid-layer feature map contains more features corresponding to the semantic information of each object, such as contour, edge, color, texture and shape.

[0039] As an example, in step S203, the computer device inputs the low-level feature map into the third hybrid convolutional module of the first initial network, and uses the third hybrid convolutional module of the first initial network to perform mid-level feature extraction on the first mid-level feature map to obtain the second mid-level feature map. In this example, the computer device uses the vector convolutional unit of the third hybrid convolutional module to extract features from the first mid-level feature map to obtain the feature map corresponding to the first mid-level feature map, and uses the multi-scale convolutional unit of the third hybrid convolutional module to perform multi-scale feature extraction on the feature map corresponding to the first mid-level feature map to obtain the second mid-level feature map. Figure 9 It can be seen that the third hybrid convolutional module includes a vector convolutional unit and a multi-scale convolutional unit. After the first mid-layer feature map passes through the vector convolutional unit in the third hybrid convolutional module, the feature map size is halved. The multi-scale convolutional unit then performs multi-scale feature extraction on the halved first mid-layer feature map, extracting the features corresponding to the mid-layer information contained in the first mid-layer feature map to obtain the second mid-layer feature map. Understandably, halving the size of the first mid-layer feature map allows for more accurate extraction of semantic information such as contours, edges, colors, textures, and shapes corresponding to various objects contained in the first mid-layer feature map. Figure 10It is known that the SimAM attention mechanism unit included in the multi-scale convolutional unit (MSC) can concentrate the extracted multi-scale features in the image region corresponding to the smaller target when performing multi-scale feature extraction on the first mid-level feature map. Applying the multi-scale convolutional unit (MSC) to the third hybrid convolutional module can effectively avoid the loss of mid-level information corresponding to the smaller target and obtain the second mid-level feature map.

[0040] Downsampling refers to reducing the size of the feature map. Feature enhancement refers to enhancing the features corresponding to smaller targets. High-level feature maps are feature maps that include high-level features. High-level features refer to features corresponding to high-level semantic information in an image. High-level semantic information only contains visually apparent details corresponding to each object in the image, such as contours, edges, colors, textures, and shapes.

[0041] As an example, in step S204, the computer device inputs the second middle-layer feature map into the fourth hybrid convolution module in the first initial network, and uses the fourth hybrid convolution module, pooling module and attention mechanism module connected in sequence to perform high-level feature extraction on the second middle-layer feature map to obtain a high-level feature map. In this example, the computer device uses the vector convolution unit in the fourth hybrid convolution module to perform vector convolution processing on the second middle-layer feature map, outputting a feature map after feature extraction from the second middle-layer feature map. Then, it uses the multi-scale convolution unit in the fourth hybrid convolution module to perform multi-scale feature extraction on the feature map output by the vector convolution unit, outputting a multi-scale extracted feature map. Next, it uses a pooling module to downsample the feature map output by the multi-scale convolution unit in the fourth hybrid convolution module, further reducing the feature map size. This reduces the size of the feature map output by the multi-scale convolution unit in the fourth hybrid convolution module, increasing the high-level semantic information corresponding to the smaller target to be detected, resulting in a feature map containing rich high-level semantic information corresponding to the smaller target. Finally, it uses an attention mechanism module to perform feature enhancement processing on the feature map output by the pooling module, further enhancing the rich high-level semantic information corresponding to the smaller target to be detected, outputting a high-level feature map. In this example, because the attention mechanism module enhances the rich high-level semantic information corresponding to the smaller target, the high-level feature map contains the enhanced semantic information corresponding to the smaller target. This facilitates accurate detection of the smaller target based on the high-level feature map including the enhanced semantic information. Figure 9 As can be seen, the pooling module in this example includes the Fast Spatial Pyramid Pooling (SPPF) module. The attention mechanism module includes C2PSA.

[0042] In this embodiment, the first vector convolution module, multiple hybrid convolution modules, pooling module and attention mechanism module connected in sequence in the first initial network are used to extract multi-layer features for smaller targets to be detected, so as to obtain feature maps corresponding to feature information at different levels. This facilitates the subsequent fusion processing of feature maps corresponding to feature information at different levels, thereby improving the detection accuracy of smaller targets in the image to be detected.

[0043] In one embodiment, the second initial network includes a plurality of upsampling fusion modules and a plurality of vector convolution fusion modules connected in sequence; each upsampling fusion module includes an upsampling unit, a concatenation unit, and a multi-scale convolution unit connected in sequence; the plurality of upsampling fusion modules are respectively a first upsampling fusion module, a second upsampling fusion module, and a third upsampling fusion module; each vector convolution fusion module includes a vector convolution unit, a concatenation unit, and a multi-scale convolution unit connected in sequence; the plurality of vector convolution fusion modules are respectively a first vector convolution fusion module and a second vector convolution fusion module.

[0044] As an example, such as Figure 9 As shown, the second initial network includes three upsampling fusion modules and two vector convolution fusion modules connected in sequence. The three upsampling fusion modules, starting from the input of the second initial network, are designated as the first, second, and third upsampling fusion modules. The first vector convolution fusion module is connected to the third upsampling fusion module, and the second vector convolution fusion module is connected to the first vector convolution fusion module. Each upsampling fusion module includes an upsampling unit, a concatenation unit, and a multi-scale convolution unit connected in sequence; each vector convolution fusion module includes a vector convolution unit, a concatenation unit, and a multi-scale convolution unit connected in sequence. The output of the concatenation unit of the first upsampling fusion module is connected to the input of the concatenation unit of the second vector convolution fusion module, and the output of the multi-scale convolution unit of the second upsampling fusion module is connected to the input of the concatenation unit of the first vector convolution fusion module, used to fuse feature maps corresponding to multi-level information.

[0045] In one embodiment, such as Figure 3 As shown, step S103 involves inputting multiple multi-layer feature maps into the second initial network of the initial detection model, and outputting multiple fused feature maps corresponding to the image to be trained, including: S301: The first upsampling fusion module is used to perform upsampling, concatenation and multi-scale convolution on the second middle-layer feature map and the high-layer feature map to obtain the first initial fusion map output by the concatenation unit of the first upsampling fusion module and the second initial fusion map output by the multi-scale convolution unit of the first upsampling fusion module. S302: The second upsampling fusion module is used to perform upsampling, splicing and multi-scale convolution processing on the first middle layer feature map and the second initial fusion map to obtain the third initial fusion map output by the multi-scale convolution unit of the second upsampling fusion module. S303: The third upsampling fusion module is used to perform upsampling, concatenation and multi-scale convolution on the bottom feature map and the third initial fusion map to obtain the first target fusion map output by the multi-scale convolution unit of the third upsampling fusion module. S304: Using the first vector convolution fusion module, vector convolution processing, splicing processing and multi-scale convolution processing are performed on the first target fusion map and the third initial fusion map to obtain the second target fusion map output by the multi-scale convolution unit of the first vector convolution fusion module; S305: The second vector convolution fusion module is used to perform vector convolution processing, splicing processing and multi-scale convolution processing on the second target fusion map and the first initial fusion map to obtain the third target fusion map output by the multi-scale convolution unit of the second vector convolution fusion module; The fusion feature map includes a first target fusion map, a second target fusion map, and a third target fusion map.

[0046] Here, the first initial fusion map refers to the feature fusion map output by the concatenation unit of the first upsampling fusion module. The second initial fusion map refers to the feature fusion map output by the multi-scale convolutional unit of the first upsampling fusion module. Upsampling processing refers to the process of enlarging the feature map. Concatenation processing refers to the process of concatenating the feature maps.

[0047] As an example, in step S301, the computer device inputs the high-level feature map output by the attention mechanism module in the first initial network to the upsampling unit of the first upsampling fusion module for feature map amplification. It also inputs the second mid-level feature map output by the multi-scale convolution unit of the fourth hybrid convolution module in the first initial network to the stitching unit of the first upsampling fusion module. The stitching unit in the first upsampling fusion module stitches and fuses the second mid-level feature map and the amplified high-level feature map, outputting a first initial fused map. The multi-scale convolution unit in the first upsampling fusion module then performs multi-scale feature extraction on the first initial fused map. Finally, the multi-scale convolution unit in the first upsampling fusion module outputs a second initial fused map. The output second initial fused map contains features from both the second mid-level and high-level feature maps, thus achieving the fusion of the second mid-level feature map and the amplified high-level feature map. In this example, fusing the second mid-level and high-level feature maps enables the fusion of features corresponding to multiple layers of information.

[0048] The third initial fusion map refers to the feature map output by the second upsampling fusion module of the second initial network.

[0049] As an example, in step S302, the computer device inputs the second initial fused image output by the multi-scale convolutional unit of the first upsampling fusion module into the upsampling unit of the second upsampling fusion module, inputs the first intermediate-layer feature map into the stitching unit of the second upsampling fusion module, and uses the upsampling unit of the second upsampling fusion module to amplify the first initial fused image. The amplified first initial feature map is then input into the stitching unit of the second upsampling fusion module, which performs feature stitching on the first intermediate-layer feature map and the amplified first initial feature map to obtain a stitched feature map. The multi-scale convolutional unit of the second upsampling fusion module performs multi-scale sampling on the stitched feature map, and outputs a third initial fused image through the multi-scale convolutional unit of the second upsampling fusion module. The third initial fused image contains the features of the first intermediate-layer feature map and the first initial feature map, thus realizing the fusion processing of the first intermediate-layer feature map and the first initial feature map. This method can achieve the fusion of features corresponding to multiple layers of information.

[0050] The first target fusion map refers to the fusion feature map output by the third upsampling fusion module, which is used as input to the third initial network to detect smaller targets.

[0051] As an example, in step S303, the computer device inputs the third initial fusion map into the upsampling unit of the third upsampling fusion module, inputs the low-level feature map into the stitching unit of the third upsampling fusion module, uses the upsampling unit of the third upsampling fusion module to enlarge the third initial fusion map, and obtains the enlarged third fusion feature map. The stitching unit of the third upsampling fusion module stitches the enlarged third fusion feature map and the low-level feature map together to obtain the stitched feature map. The multi-scale convolution unit of the third upsampling fusion module performs multi-scale feature extraction processing on the stitched feature map. The first target fusion map is output through the multi-scale convolution unit of the third upsampling fusion module, thereby realizing the fusion of the third initial fusion map and the low-level feature map. In this example, since the third initial fusion map is obtained by fusing high-level feature maps and mid-level feature maps, including high-level semantic information and mid-level feature information corresponding to the image to be trained, fusing the third initial fusion map and the low-level feature map can fuse the high-level semantic information of the image to be trained, including the contour, edge, color, texture and shape of the smaller target to be detected, as well as the low-level feature information including the position information of the smaller target to be detected. This makes the first target fusion map contain the high-level semantic information of the contour, edge, color, texture and shape of the smaller target to be detected, as well as the position information of the smaller target to be detected. This allows the third initial network to detect the smaller target to be detected more accurately when it detects the first target fusion map.

[0052] The second target fusion map refers to the fusion feature map output by the first vector convolution fusion module, which is used as input to the third initial network to detect smaller targets.

[0053] As an example, in step S304, the computer device inputs the first target fusion image into the vector convolution unit of the first vector convolution fusion module, and inputs the third initial fusion image into the stitching unit of the first vector convolution fusion module. The vector convolution unit of the first vector convolution fusion module performs vector convolution processing on the first target fusion image to obtain a feature map with half the size. The stitching unit of the first vector convolution fusion module stitches the feature map with half the size and the third initial fusion image to obtain a stitched feature map. The multi-scale convolution unit of the first vector convolution fusion module performs multi-scale feature extraction processing on the stitched feature map to obtain the second target fusion image output by the multi-scale convolution unit of the first vector convolution fusion module, thereby realizing the fusion processing of the first target fusion image and the third initial fusion image. In this example, a first target fusion map and a third initial fusion map containing high-level semantic information such as the contour, edge, color, texture, and shape of the smaller target to be detected, as well as the position information of the smaller target to be detected, are fused together. This is so that the second target fusion map contains the position information and high-level semantic information such as the contour, edge, color, texture, and shape of the smaller target to be detected, so that the smaller target to be detected can be accurately detected by subsequent detection of the second target fusion map.

[0054] The third target fusion map refers to the fusion feature map output by the second vector convolution fusion module, which is used as input to the third initial network to detect smaller targets.

[0055] As an example, in step S305, the computer device inputs the second target fusion image into the vector convolution unit of the second vector convolution fusion module, and inputs the first initial fusion image into the stitching unit of the second vector convolution fusion module. The vector convolution unit of the second vector convolution fusion module performs vector convolution processing on the second target fusion image, outputting a feature map with half the size. The stitching unit of the second vector convolution fusion module stitches the first initial fusion image and the feature map with half the size output by the vector convolution unit of the second vector convolution fusion module to obtain a stitched feature map. The multi-scale convolution unit of the second vector convolution fusion module performs multi-scale feature extraction on the stitched feature map output by the stitching unit of the second vector convolution fusion module to obtain the third target fusion image output by the multi-scale convolution unit of the second vector convolution fusion module, thus realizing the fusion processing of the second target fusion image and the first initial fusion image. In this example, a second target fusion map and a first initial fusion map, which contain high-level semantic information such as the contour, edge, color, texture, and shape of the smaller target to be detected, as well as the position information of the smaller target to be detected, are fused together. This is so that the third target fusion map contains the position information and high-level semantic information such as the contour, edge, color, texture, and shape of the smaller target to be detected, so that the smaller target to be detected can be accurately detected by subsequent detection of the third target fusion map.

[0056] Depend on Figure 10 It is known that the SimAM attention mechanism unit included in the multi-scale convolutional unit (MSC) can concentrate the extracted multi-scale features into the image region corresponding to the smaller target when performing multi-scale feature extraction on the feature map. Applying the multi-scale convolutional unit (MSC) to the first upsampling fusion module, the second upsampling fusion module, the third upsampling fusion module, the first vector convolution fusion module, and the second vector convolution fusion module can effectively fuse the feature information corresponding to the smaller target, thereby improving the detection accuracy of the subsequent detection of the smaller target.

[0057] In this embodiment, multiple upsampling fusion modules and multiple vector convolution fusion modules in the second initial network are used to fuse the low-level feature map, the first mid-level feature map, the second mid-level feature map, and the high-level feature map. The output is a fused feature map that fuses the multi-layer feature information of the smaller target to be detected. This can effectively avoid the loss of the position information of the smaller target to be detected, as well as the high-level semantic information such as contour, edge, color, texture, and shape. This makes it possible to fully detect the position information of the smaller target to be detected, as well as the high-level semantic information such as contour, edge, color, texture, and shape, when detecting the smaller target to be detected, thereby improving the detection accuracy of the smaller target to be detected.

[0058] In one embodiment, the third initial network includes a first detection unit, a second detection unit, and a third detection unit; the predicted target detection box includes a first predicted detection box, a second preset detection box, and a third predicted detection box.

[0059] As an example, such as Figure 9 As shown, the third initial network includes three detection units. The input of the first detection unit is connected to the output of the multi-scale convolutional unit of the third upsampling fusion module of the second initial network, and is used to detect the first target fusion map, thereby detecting the location region corresponding to the smaller target in the image to be detected. The input of the second detection unit is connected to the output of the multi-scale convolutional unit of the first vector convolutional fusion module of the second initial network, and is used to detect the second target fusion map, thereby detecting the location region corresponding to the smaller target in the image to be detected. The input of the third detection unit is connected to the output of the multi-scale convolutional unit of the second vector convolutional fusion module of the second initial network, and is used to detect the third target fusion map, thereby detecting the location region corresponding to the smaller target in the image to be detected.

[0060] In this context, the first predicted detection box refers to the location region corresponding to the smaller target in the image to be detected output by the first detection unit. The second predicted detection box refers to the location region corresponding to the smaller target in the image to be detected output by the second detection unit. The third predicted detection box refers to the location region corresponding to the smaller target in the image to be detected output by the third detection unit.

[0061] In one embodiment, such as Figure 4 As shown, step S104 involves inputting multiple fused feature maps into the third initial network of the object detection model, and outputting at least one predicted object detection box, including: S401: The first detection unit is used to perform detection processing on the first target fusion image, and at least one first predicted detection box is output; S402: The second detection unit is used to perform detection processing on the second target fusion map, and at least one second prediction detection box is output; S403: The third detection unit is used to perform detection processing on the third target fusion map and output at least one third prediction detection box.

[0062] As an example, in step S401, the computer device inputs the first target fusion map output by the third upsampling fusion module in the second initial network to the first detection unit of the third initial network, uses the first detection unit of the third initial network to perform detection processing on the first target fusion map, outputs the detection boxes corresponding to the regions of all smaller targets in the image to be detected, and determines the detection boxes corresponding to the regions of all smaller targets as the first predicted detection box corresponding to each smaller target.

[0063] As an example, in step S402, the computer device inputs the second target fusion map output by the first vector convolution fusion module in the second initial network to the second detection unit of the third initial network, and uses the second detection unit of the third initial network to perform detection processing on the second target fusion map, outputting the detection boxes corresponding to the regions of all smaller targets in the image to be detected, and determining the detection boxes corresponding to the regions of all smaller targets as the second predicted detection box corresponding to each smaller target.

[0064] As an example, in step S403, the computer device inputs the third target fusion map output by the second vector convolution fusion module in the second initial network to the third detection unit of the third initial network, and uses the third detection unit of the third initial network to perform detection processing on the third target fusion map, outputting the detection boxes corresponding to the regions of all smaller targets in the image to be detected, and determining the detection boxes corresponding to the regions of all smaller targets as the third predicted detection box corresponding to each smaller target.

[0065] In this embodiment, three detection units of the third initial network are used to detect the first target fusion map, the second target fusion map, and the third target fusion map respectively, and determine the first predicted detection box, the first predicted detection box, and the third predicted detection box. This allows the initial detection model to be updated based on the first predicted detection box and the third predicted detection box, thereby determining a target detection model that can accurately detect smaller targets.

[0066] In one embodiment, such as Figure 5 As shown, step S105, which involves determining the target loss function value of the target detection model based on at least one predicted target detection box and the corresponding training target label box for each predicted target detection box, includes: S501: Based on at least one first predicted detection box and the training target label box corresponding to each first predicted detection box, determine the first loss function value corresponding to the first detection unit; S502: Based on at least one second predicted detection box and the training target label box corresponding to each second predicted detection box, determine the second loss function value corresponding to the second detection unit; S503: Based on at least one third predicted detection box and the training target label box corresponding to each third predicted detection box, determine the third loss function value corresponding to the third detection unit; S504: Determine the target loss function value of the target detection model based on the first loss function value, the second loss function value, and the third loss function value.

[0067] Here, the first loss function value refers to the loss function value corresponding to the first detection unit.

[0068] As an example, in step S501, the computer device acquires all the first predicted detection boxes output by the first detection unit and the training target label box corresponding to each first predicted detection box, and determines the first... The first height of the first predicted detection box First width x-coordinate of the first center point and the ordinate of the first center point and the The label height of the training target label box corresponding to the first predicted detection box. Label width x-coordinate of the label center point and the y-coordinate of the label center point Based on each first predicted detection box and its corresponding training target label box, determine the smoothness of all first predicted detection boxes. loss function value The NWD (Normalized Wasserstein Distance) loss function value between each first predicted detection box and the corresponding training target label box of each first predicted detection box. The first loss function value corresponding to the first detection unit is: = in, For the first There are k first predicted detection boxes, where k is the total number of first predicted detection boxes detected by the first detection unit.

[0069] Among them, the smoothness corresponding to the first predicted detection box loss function value for: = in, = .

[0070] No. The NWD loss function value corresponding to the first prediction detection box for: =exp( ), in, The normalization parameter is typically k, which is the number of the first predicted detection boxes. For the first The first predicted detection box and the first Wasserstein distance between the training target label boxes corresponding to the first predicted detection boxes: = .

[0071] The second loss function value refers to the loss function value corresponding to the second detection unit.

[0072] As an example, in step S502, the computer device acquires all the second predicted detection boxes output by the second detection unit and the training target label box corresponding to each second predicted detection box, and determines the first... The second height of the second predicted detection box Second width The x-coordinate of the second center point The ordinate of the second center point and the The label height of the training target label box corresponding to the second predicted detection box. Label width x-coordinate of the label center point and the y-coordinate of the label center point Based on each second predicted detection box and its corresponding training target label box, determine the smoothness corresponding to all second predicted detection boxes. loss function value The NWD (Normalized Wasserstein Distance) loss function value between each second predicted detection box and the corresponding training target label box of each second predicted detection box. The second loss function value corresponding to the second detection unit is: = in, For the first There are m second predicted detection boxes, where m is the total number of second predicted detection boxes detected by the second detection unit.

[0073] Among them, the smoothness corresponding to the second predicted detection box loss function value for: = in, = .

[0074] No. The NWD loss function value corresponding to the second prediction detection box for: =exp( ), in, The normalization parameter is typically the number of second predicted detection boxes, m. For the first The second predicted detection box and the first Wasserstein distance between the training target label boxes corresponding to the second predicted detection boxes: = .

[0075] The third loss function value refers to the loss function value corresponding to the third detection unit.

[0076] As an example, in step S503, the computer device acquires all the third predicted detection boxes output by the third detection unit and the training target label box corresponding to each third predicted detection box, and determines the... The third height of the third predicted detection box Third width The x-coordinate of the third center point and the ordinate of the third center point and the The label height of the training target label box corresponding to the third predicted detection box. Label width x-coordinate of the label center point and the y-coordinate of the label center point Based on each third predicted detection box and its corresponding training target label box, determine the smoothness corresponding to all third predicted detection boxes. loss function value The NWD (Normalized Wasserstein Distance) loss function value between each third predicted detection box and the corresponding training target label box of each third predicted detection box. The third loss function value corresponding to the third detection unit is: = in, For the first There are n third predicted detection boxes, where n is the total number of third predicted detection boxes detected by the third detection unit.

[0077] Among them, the smoothness corresponding to the third predicted detection box loss function value for: = in, = .

[0078] No. The NWD loss function value corresponding to the third prediction detection box for: =exp( ), in, The normalization parameter is typically the number of third predicted detection boxes, n. For the first The third prediction detection box and the first Wasserstein distance between the training target label boxes corresponding to the third predicted detection boxes: = .

[0079] As an example, in step S504, the computer device corrects the first loss function value using a first correction coefficient to obtain a first correction result, corrects the second loss function value using a second correction coefficient to obtain a second correction result, and corrects the third loss function value using a third correction coefficient to obtain a third correction result. The sum of the first, second, and third correction results is determined as the target loss function value of the initial detection model. That is, the target loss function value is: L= + + in, To determine the value of the first loss function The first correction factor, To determine the value of the second loss function The second correction factor, To determine the value of the third loss function The third correction factor.

[0080] In this example, = = = That is, the first loss function value corresponding to the first detection unit. The second loss function value corresponding to the second detection unit The third loss function value corresponding to the third detection unit The target loss function value is obtained by averaging.

[0081] In this embodiment, the NWD loss function value can effectively reflect the similarity between the predicted detection box corresponding to a smaller target and the training label detection box, smoothing the process. The loss function value can accurately reflect the difference between the center point of the predicted detection box corresponding to a small target and the center point of the training label detection box. Based on the NWD loss function value between each predicted detection box and its corresponding training target label box, and the smoothness... The loss function value is determined for each detection unit, thereby accurately determining the target loss function value of the initial detection model. This enables precise judgment of whether each detection unit can accurately identify smaller targets in the training image.

[0082] In another embodiment, such as Figure 6 As shown, a target detection method is provided, which uses a target detection model trained by the target detection model training method in the above embodiments to detect small targets in an image to be detected. This method is then applied to... Figure 8 Taking a computer device as an example, the explanation includes the following steps: S601: Acquire the image to be detected; S602: Using the trained target detection model, target detection is performed on the image to be detected, and the predicted target detection box of at least one target corresponding to the image to be detected is obtained.

[0083] As an example, in step S601, the computer device acquires an image of the target object to be detected, which is relatively small. In this example, the image can be acquired in different application scenarios, such as autonomous driving, security monitoring, and road and bridge defect detection.

[0084] As an example, in step S602, the computer device inputs the image to be detected into the target detection model trained by the target detection model training method in the above embodiment, and uses the target detection model to perform target detection on the smaller target in the image to be detected, thereby obtaining a predicted target detection box of at least one smaller target corresponding to the image to be detected. In this example, the computer device inputs the image to be detected into the first initial network of the target detection model after training, and outputs the low-level feature map, the first mid-level feature map, the second mid-level feature map, and the high-level feature map corresponding to the image to be detected. The low-level feature map, the first mid-level feature map, the second mid-level feature map, and the high-level feature map are then input into the second initial network after training for special addition fusion, resulting in the first target fusion map, the second target fusion map, and the third target fusion map. The first target fusion map is then input into the first detection unit of the third initial network of the target detection model after training, and outputs the first predicted detection box corresponding to each smaller target to be detected obtained by detecting the first target fusion map. The second target fusion map is then input into the second detection unit of the third initial network of the target detection model after training, and outputs the second predicted detection box corresponding to each smaller target to be detected obtained by detecting the second target fusion map. The third target fusion map is then input into the third detection unit of the third initial network of the target detection model after training, and outputs the third predicted detection box corresponding to each smaller target to be detected obtained by detecting the third target fusion map. Understandably, the first target fusion map is output by the multi-scale convolutional unit (MSC) of the third upsampling fusion module of the second initial network. In the second initial network, no vector convolutional units are used, resulting in a larger feature map with higher resolution, containing more features corresponding to smaller targets, thus improving detection accuracy. The second and third target fusion maps are obtained by further fusing feature maps corresponding to multiple layers of information through the vector convolutional fusion module. These maps contain rich multi-layered information corresponding to smaller targets, thereby improving the detection accuracy for smaller targets.

[0085] In this embodiment, a target detection model that can detect smaller targets with high accuracy is used to detect smaller targets in the image to be detected, and at least one predicted target detection box of a smaller target is obtained, which has high detection accuracy.

[0086] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0087] In one embodiment, a target detection model training device is provided, which corresponds one-to-one with the target detection model training method in the above embodiments. For example... Figure 7As shown, the target detection model training device includes a training image acquisition module 701, a multi-layer feature map acquisition module 702, a fusion feature map acquisition module 703, a predicted target detection box acquisition module 704, a target loss function value determination module 705, and a target detection model determination module 706. Detailed descriptions of each functional module are as follows: The training image acquisition module 701 is used to acquire the training image, which corresponds to at least one training target label box; the size of the training target label box is smaller than a first size threshold. The multi-layer feature map acquisition module 702 is used to input the image to be trained into the first initial network in the initial detection model and output multiple multi-layer feature maps corresponding to the image to be trained. The fusion feature map acquisition module 703 is used to input multiple multi-layer feature maps into the second initial network in the initial detection model and output multiple fusion feature maps corresponding to the image to be trained. The predicted target detection box acquisition module 704 is used to input multiple fused feature maps into the third initial network in the initial detection model and output at least one predicted target detection box. The target loss function value determination module 705 determines the target loss function value of the initial detection model based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box. The target detection model determination module 706 is used to determine the target detection model corresponding to the initial detection model based on the first initial network, the second initial network, and the third initial network when the target loss function value meets the preset convergence condition.

[0088] In one embodiment, the multi-layer feature map acquisition module 702 includes: The low-level feature map acquisition submodule is used to perform first vector convolution processing, second vector convolution processing, and multi-scale convolution processing on the image to be trained using the first vector convolution module and the first hybrid convolution module to obtain the low-level feature map corresponding to the image to be trained. The first middle-layer feature map acquisition submodule is used to perform vector convolution and multi-scale convolution on the bottom-layer feature map using the second hybrid convolution module to obtain the first middle-layer feature map. The second middle-layer feature map acquisition submodule is used to perform vector convolution and multi-scale convolution on the first middle-layer feature map using the third hybrid convolution module to obtain the second middle-layer feature map. The high-level feature map acquisition submodule is used to perform vector convolution processing, multi-scale convolution processing, downsampling processing, and feature enhancement processing on the second middle-layer feature map using the fourth hybrid convolution module, pooling module, and attention mechanism module to obtain the high-level feature map.

[0089] In one embodiment, the fused feature map acquisition module 703 includes: The first fusion submodule is used to perform upsampling, concatenation and multi-scale convolution processing on the second middle-layer feature map and the high-layer feature map using the first upsampling fusion module, so as to obtain the first initial fusion map output by the concatenation unit of the first upsampling fusion module and the second initial fusion map output by the multi-scale convolution unit of the first upsampling fusion module. The second fusion submodule is used to perform upsampling, concatenation and multi-scale convolution processing on the first middle layer feature map and the second initial fusion map using the second upsampling fusion module, so as to obtain the third initial fusion map output by the multi-scale convolution unit of the second upsampling fusion module. The third fusion submodule is used to perform upsampling, concatenation and multi-scale convolution on the low-level feature map and the third initial fusion map using the third upsampling fusion module, so as to obtain the first target fusion map output by the multi-scale convolution unit of the third upsampling fusion module. The fourth fusion submodule is used to perform vector convolution processing, concatenation processing and multi-scale convolution processing on the first target fusion map and the third initial fusion map using the first vector convolution fusion module, so as to obtain the second target fusion map output by the multi-scale convolution unit of the first vector convolution fusion module; The fifth fusion submodule is used to perform vector convolution processing, splicing processing and multi-scale convolution processing on the second target fusion map and the first initial fusion map using the second vector convolution fusion module, so as to obtain the third target fusion map output by the multi-scale convolution unit of the second vector convolution fusion module.

[0090] In one embodiment, the predicted target detection box acquisition module 704 includes: The first predicted detection box acquisition unit is used to perform detection processing on the first target fusion map using the first detection unit and output at least one first predicted detection box. The second prediction detection box acquisition unit is used to perform detection processing on the second target fusion map using the second detection unit, and output at least one second prediction detection box. The third prediction detection box acquisition unit is used to perform detection processing on the third target fusion map using the third detection unit, and output at least one third prediction detection box.

[0091] In one embodiment, the target loss function value determination module 705 includes: The first loss function value determination submodule determines the first loss function value corresponding to the first detection unit based on at least one first predicted detection box and the training target label box corresponding to each first predicted detection box. The second loss function value determination submodule determines the second loss function value corresponding to the second detection unit based on at least one second predicted detection box and the training target label box corresponding to each second predicted detection box. The third loss function value determination submodule determines the third loss function value corresponding to the third detection unit based on at least one third predicted detection box and the training target label box corresponding to each third predicted detection box. The target loss function value determination submodule determines the target loss function value of the target detection model based on the first loss function value, the second loss function value, and the third loss function value.

[0092] In another embodiment, a target detection device is provided, which corresponds to the target detection method in the above embodiments. The target detection device includes a target image acquisition module and a detection module. Detailed descriptions of each functional module are as follows: The image acquisition module is used to acquire the image to be detected. The detection module is used to perform target detection on the image to be detected using the trained target detection model, and obtain the predicted target detection box of at least one target corresponding to the image to be detected.

[0093] Specific limitations regarding the object detection model training device and the object detection device can be found in the limitations regarding the object detection model training method and the object detection method above, and will not be repeated here. Each module in the aforementioned object detection model training device and object detection device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0094] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 8 As shown. The computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores data used or generated during the execution of object detection model training methods and object detection methods. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements an object detection model training method, or, when executed by the processor, implements an object detection method.

[0095] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the target detection model training method described in the above embodiment, for example... Figure 1 As shown in S101-S106, or Figures 2 to 5 As shown, to avoid repetition, it will not be described again here. Alternatively, the processor may implement the object detection model training method in the above embodiments when executing a computer program, for example... Figure 6 S601-S602, as shown, will not be described again here to avoid repetition. Alternatively, when the processor executes the computer program, it implements the functions of each module / unit in this embodiment of the object detection model training device, for example... Figure 7 The functions of the training image acquisition module 701, multi-layer feature map acquisition module 702, fusion feature map acquisition module 703, predicted target detection box acquisition module 704, target loss function value determination module 705, and target detection model determination module 706 shown are not described again here to avoid repetition. Alternatively, the processor may implement the functions of each module / unit in this embodiment of the target detection device when executing a computer program, such as the functions of the training image acquisition module and the detection module; these will also not be described again here to avoid repetition.

[0096] In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored. When executed by a processor, the computer program implements the target detection model training method described in the above embodiment, for example... Figure 1 As shown in S101-S106, or Figures 2 to 5 As shown, to avoid repetition, it will not be described again here. Alternatively, when the computer program is executed by the processor, it implements the object detection model training method in the above embodiments, for example... Figure 6 S601-S602, as shown, will not be described again here to avoid repetition. Alternatively, when this computer program is executed by the processor, it implements the functions of each module / unit in this embodiment of the object detection model training device, for example... Figure 7 The functions of the training image acquisition module 701, multi-layer feature map acquisition module 702, fusion feature map acquisition module 703, predicted target detection box acquisition module 704, target loss function value determination module 705, and target detection model determination module 706 shown are not described again here to avoid repetition. Alternatively, when this computer program is executed by a processor, it implements the functions of each module / unit in this embodiment of the target detection device, such as the functions of the training image acquisition module and the detection module. These will also not be described again here to avoid repetition.

[0097] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0098] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.

[0099] The above-described embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.

Claims

1. A method for training an object detection model, characterized in that, include: Obtain a training image, wherein the training image corresponds to at least one training target label box; The size of the training target label box is smaller than a first size threshold; The image to be trained is input into the first initial network in the initial detection model, and multiple multi-layer feature maps corresponding to the image to be trained are output. Multiple multi-layer feature maps are input into the second initial network in the initial detection model, and multiple fused feature maps corresponding to the image to be trained are output. Multiple fused feature maps are input into the third initial network in the initial detection model, and at least one predicted target detection box is output. Based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box, the target loss function value of the initial detection model is determined; When the target loss function value satisfies the preset convergence condition, the target detection model corresponding to the initial detection model is determined based on the first initial network, the second initial network, and the third initial network.

2. The target detection model training method according to claim 1, characterized in that, The first initial network includes a first vector convolution module, multiple hybrid convolution modules, a pooling module, and an attention mechanism module connected in sequence; each of the hybrid convolution modules includes a vector convolution unit and a multi-scale convolution unit connected in sequence, and the multiple hybrid convolution modules are a first hybrid convolution module, a second hybrid convolution module, a third hybrid convolution module, and a fourth hybrid convolution module. The step of inputting the image to be trained into the first initial network of the initial detection model and outputting multiple multi-layer feature maps corresponding to the image to be trained includes: The first vector convolution module and the first hybrid convolution module are used to perform first vector convolution processing, second vector convolution processing and multi-scale convolution processing on the image to be trained to obtain the low-level feature map corresponding to the image to be trained. The second hybrid convolution module is used to perform vector convolution and multi-scale convolution on the low-level feature map to obtain the first middle-level feature map; The third hybrid convolution module is used to perform vector convolution and multi-scale convolution on the first middle-layer feature map to obtain the second middle-layer feature map. The fourth hybrid convolution module, pooling module, and attention mechanism module are used to perform vector convolution, multi-scale convolution, downsampling, and feature enhancement on the second middle-layer feature map to obtain the high-level feature map.

3. The target detection model training method according to claim 2, characterized in that, The second initial network includes multiple upsampling fusion modules and multiple vector convolution fusion modules connected in sequence; each upsampling fusion module includes an upsampling unit, a concatenation unit, and a multi-scale convolution unit connected in sequence, and the multiple upsampling fusion modules are respectively a first upsampling fusion module, a second upsampling fusion module, and a third upsampling fusion module; each vector convolution fusion module includes a vector convolution unit, a concatenation unit, and a multi-scale convolution unit connected in sequence, and the multiple vector convolution fusion modules are respectively a first vector convolution fusion module and a second vector convolution fusion module; The process of inputting multiple multi-layer feature maps into the second initial network of the initial detection model and outputting multiple fused feature maps corresponding to the image to be trained includes: Using the first upsampling fusion module, the second middle-layer feature map and the high-layer feature map are upsampled, spliced, and multi-scale convolutional to obtain the first initial fusion map output by the splicing unit of the first upsampling fusion module and the second initial fusion map output by the multi-scale convolution unit of the first upsampling fusion module. The second upsampling fusion module is used to perform upsampling, concatenation and multi-scale convolution on the first middle layer feature map and the second initial fusion map to obtain the third initial fusion map output by the multi-scale convolution unit of the second upsampling fusion module. A third upsampling fusion module is used to perform upsampling, concatenation and multi-scale convolution on the bottom feature map and the third initial fusion map to obtain the first target fusion map output by the multi-scale convolution unit of the third upsampling fusion module. A first vector convolution fusion module is used to perform vector convolution processing, concatenation processing, and multi-scale convolution processing on the first target fusion map and the third initial fusion map to obtain the second target fusion map output by the multi-scale convolution unit of the first vector convolution fusion module. The second vector convolution fusion module is used to perform vector convolution processing, concatenation processing and multi-scale convolution processing on the second target fusion map and the first initial fusion map to obtain the third target fusion map output by the multi-scale convolution unit of the second vector convolution fusion module; The fused feature map includes the first target fused map, the second target fused map, and the third target fused map.

4. The target detection model training method according to claim 3, characterized in that, The third initial network includes a first detection unit, a second detection unit, and a third detection unit; the predicted target detection box includes a first predicted detection box, a second preset detection box, and a third predicted detection box. The process of inputting multiple fused feature maps into a third initial network in the initial detection model and outputting at least one predicted target detection box includes: The first detection unit is used to perform detection processing on the first target fusion image, and at least one first predicted detection box is output; The second detection unit is used to perform detection processing on the second target fusion map, and at least one second predicted detection box is output. The third detection unit is used to perform detection processing on the third target fusion map, and at least one third prediction detection box is output.

5. The target detection model training method according to claim 4, characterized in that, The step of determining the target loss function value of the initial detection model based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box includes: Based on at least one first predicted detection box and the training target label box corresponding to each first predicted detection box, determine the first loss function value corresponding to the first detection unit; Based on at least one second predicted detection box and the training target label box corresponding to each second predicted detection box, determine the second loss function value corresponding to the second detection unit; Based on at least one of the third predicted detection boxes and the training target label box corresponding to each of the third predicted detection boxes, the third loss function value corresponding to the third detection unit is determined. Based on the first loss function value, the second loss function value, and the third loss function value, the target loss function value of the initial detection model is determined.

6. A target detection method, characterized in that, include: Acquire the image to be detected; The target detection model trained using the target detection model training method according to any one of claims 1 to 5 is used to perform target detection on the image to be detected, thereby obtaining a predicted target detection box for at least one target to be detected corresponding to the image to be detected.

7. A target detection model training device, characterized in that, include: The training image acquisition module is used to acquire the training image, wherein the training image corresponds to at least one training target label box; The size of the training target label box is smaller than a first size threshold; The multi-layer feature map acquisition module is used to input the image to be trained into the first initial network in the initial detection model and output multiple multi-layer feature maps corresponding to the image to be trained. The fusion feature map acquisition module is used to input multiple multi-layer feature maps into the second initial network in the initial detection model and output multiple fusion feature maps corresponding to the image to be trained. The predicted target detection box acquisition module is used to input multiple fused feature maps into the third initial network in the initial detection model and output at least one predicted target detection box. The target loss function value determination module determines the target loss function value of the initial detection model based on at least one predicted target detection box and the training target label box corresponding to each predicted target detection box. The target detection model determination module is used to determine the target detection model corresponding to the initial detection model based on the first initial network, the second initial network, and the third initial network when the target loss function value satisfies the preset convergence condition.

8. A target detection device, characterized in that, include: The image acquisition module is used to acquire the image to be detected. The detection module is used to perform target detection on the image to be detected using a trained target detection model, and to obtain a predicted target detection box for at least one target to be detected corresponding to the image to be detected.

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the target detection model training method according to any one of claims 1 to 5; or, when the processor executes the computer program, it implements the target detection method according to claim 6.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the target detection model training method according to any one of claims 1 to 5, or, when the computer program is executed by the processor, it implements the target detection method according to claim 6.