A target detection method and device, electronic equipment and storage medium

By extracting candidate image patches from infrared images and performing super-resolution processing, selecting target sampling points, and iteratively training the target detection model, the problem of low target detection accuracy in infrared images is solved, and efficient target detection is achieved.

CN121811022BActive Publication Date: 2026-06-26ZHEJIANG LAB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG LAB
Filing Date
2026-03-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Infrared images have low target detection accuracy. Existing super-resolution processing methods are time-consuming and wasteful of computational resources, and background enhancement reduces the distinction between the target and the background.

Method used

Candidate image patches are extracted from infrared images, target sampling points are selected based on feature maps, super-resolution processing is performed, and the target detection model is iteratively trained. The model is then optimized using super-resolution image patches.

Benefits of technology

It saves computing resources, shortens training time, and improves the efficiency and accuracy of target detection, especially the ability to extract features from the target's neighborhood.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121811022B_ABST
    Figure CN121811022B_ABST
Patent Text Reader

Abstract

The application provides a target detection method and device, electronic equipment and storage medium, and relates to the technical field of computers. A pre-trained specified model is used to detect a target from an original image and extract a candidate image block containing the target from the original image; a feature map of the candidate image block is determined based on pixel values of each pixel in the candidate image block; a target sampling point is selected in the candidate image block according to the feature map, and a target image block containing the target sampling point is extracted from the original image; a target detection model is used to perform super-resolution processing on the target image block, that is, an image adjacent to the target in the original image, to obtain a super-resolution image block, thereby saving computing resources and improving detection efficiency. The target detection model is iteratively trained using the super-resolution image block, the feature extraction capability of the model for images in a region adjacent to the target is improved, and the accuracy of target detection is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a target detection method, apparatus, electronic device, and storage medium. Background Technology

[0002] Infrared sensors generate infrared images based on the thermal radiation of objects, without relying on external light sources, and are therefore widely used in scenarios such as disaster detection, field search and rescue, and night patrols. However, due to limitations in the hardware and shooting distance of infrared sensors, infrared images often suffer from low resolution, low signal-to-noise ratio, and complex backgrounds. Especially in scenarios like disaster detection, field search and rescue, and night patrols, targets often occupy only a few pixels in infrared images, making them more easily obscured by the background (areas other than the target) and noise. This makes it difficult for target detection models (i.e., deep learning-based neural network models, such as the YOLO model) to accurately detect targets from infrared images, resulting in low target detection accuracy.

[0003] Currently, super-resolution techniques are typically introduced before or during target detection in target detection models. By performing super-resolution processing on infrared images, the number of pixels occupied by the target in the infrared image is increased, thereby improving the accuracy of target detection.

[0004] However, super-resolution processing of the entire infrared image has several drawbacks. First, it requires a significant amount of processing time, resulting in low target detection efficiency. Second, it requires substantial computational resources, with the majority of these resources wasted on super-resolution processing of the background. Third, while enhancing the target, the background is also enhanced, which reduces the distinction between the target and the background, making it difficult for the target detection model to detect targets from the infrared image and affecting the accuracy of target detection.

[0005] Therefore, there is an urgent need to provide a target detection method that can save computational resources and improve the accuracy and efficiency of target detection. Summary of the Invention

[0006] In view of this, this application provides a target detection method, the method comprising:

[0007] Obtain a training dataset for training the object detection model; the training dataset includes original images;

[0008] The target is detected from the original image using a pre-trained specified model, and candidate image patches containing the target are extracted from the original image.

[0009] Based on the pixel values ​​of each pixel in the candidate image block, a feature map of the candidate image block is determined; the feature map includes at least one of a gradient saliency map for characterizing gradient differences between the pixels, a texture heterogeneity map for characterizing texture differences between the pixels, and a local contrast map for characterizing grayscale differences between the pixels.

[0010] Based on the feature map, target sampling points are selected within the candidate image blocks; and target image blocks containing the target sampling points are extracted from the original image.

[0011] The target image patch is super-resolution processed using the target detection model to obtain a super-resolution image patch;

[0012] The target detection model is iteratively trained using the super-resolution image patch until the training termination condition is met, resulting in the final target detection model.

[0013] Optionally, selecting target sampling points within the candidate image block based on the feature map includes:

[0014] A sampling grid is constructed on the candidate image block, and the grid points of the sampling grid are used as initial sampling points; the spacing between adjacent grid points in the sampling grid is a preset spacing.

[0015] The feature map is subjected to specified processing to obtain a probability map corresponding to the candidate image block; the specified processing includes at least normalization processing; the probability map is composed of probability values ​​that correspond one-to-one with each pixel.

[0016] The probability value of the probability map at the initial sampling point is taken as the target probability value;

[0017] Generate standard Gaussian random numbers as offset coefficients for the initial sampling points;

[0018] The target offset distance of the initial sampling point is the product of the target probability value, the offset coefficient, and the preset unit offset distance; the unit offset distance is less than the preset spacing.

[0019] The initial sampling point is moved by the target offset distance along the specified direction to obtain the target sampling point.

[0020] Optionally, after moving the initial sampling point by the target offset distance along the specified direction to obtain the target sampling point, the method further includes:

[0021] Determine the total number of sampling points for the target sampling point;

[0022] The product of the preset coefficient and the total number of sampling points is used as the supplementary number of sampling points;

[0023] From the candidate image block, select several first supplementary sampling points;

[0024] The probability value of the probability map at the first supplementary sampling point is taken as the first supplementary probability value;

[0025] Among the first supplementary sampling points, the supplementary sampling points whose first supplementary probability value is greater than the preset probability value are taken as the target sampling points to be added.

[0026] Optionally, after moving the initial sampling point by the target offset distance along the specified direction to obtain the target sampling point, the method further includes:

[0027] Obtain a pre-constructed mapping relationship; the mapping relationship is the correspondence between the probability value corresponding to the sampling point and the sampling density; the probability value is positively correlated with the sampling density; the sampling density is the density of sampling points in an area centered on the sampling point with a preset length and a preset width;

[0028] Based on the mapping relationship, the target sampling density corresponding to the target probability value is determined;

[0029] Determine the actual sampling density of the target area where the target sampling point is located;

[0030] If the actual sampling density is less than the target sampling density, then the sampling interval is determined based on the target sampling density; according to the sampling interval, a second supplementary sampling point is selected in the target area as an additional target sampling point.

[0031] Optionally, after moving the initial sampling point by the target offset distance along the specified direction to obtain the target sampling point, the method further includes:

[0032] Based on the target probability value, a perturbation coefficient is determined; the perturbation coefficient is negatively correlated with the target probability value.

[0033] The product of the disturbance coefficient and the preset unit disturbance distance is taken as the target disturbance distance; the unit disturbance distance is less than the preset spacing.

[0034] The target sampling point is moved by the target perturbation distance along the specified direction to obtain the final target sampling point.

[0035] Optionally, the target detection model includes a backbone network, a neck network, and a detection head; the backbone network is used to extract features from the original image to generate original feature maps at different scales; the neck network is used to perform feature fusion on the original feature maps; the detection head is used to generate target detection results based on the feature-fused original feature maps; the target detection results include a predicted bounding box and a confidence score indicating the presence of a target within the predicted bounding box.

[0036] The step of iteratively training the target detection model using the super-resolution image patch until the training termination condition is met, and obtaining the final target detection model, includes:

[0037] Obtain a pre-constructed joint loss function; the joint loss function includes an object detection loss term, a confidence loss term, and a super-resolution reconstruction loss term; wherein, the object detection loss term is determined based on the degree of overlap between the predicted bounding box and the pre-labeled ground truth bounding box; the confidence loss term is the difference between the confidence score of the current training and the confidence score of the previous training; the super-resolution reconstruction loss term is the Euclidean distance between the super-resolution image patch and the target image patch;

[0038] Based on the joint loss function, the target detection model is trained in segments until the joint loss function reaches its minimum value, thus obtaining the final target detection model.

[0039] Optionally, the super-resolution enhancement module includes a neighbor ensemble module, a multi-scale global context module, and a super-resolution module; the step of using the target detection model to perform super-resolution processing on the target image patch to obtain a super-resolution image patch includes:

[0040] Obtain the original feature map; and crop out the candidate feature map corresponding to the candidate image patch and the target feature map corresponding to the target image patch from the original feature map;

[0041] The candidate feature map and the original feature map are input into the neighbor integration module to enhance the original feature map and obtain an enhanced feature map; a local feature map corresponding to the candidate image block is cropped from the enhanced feature map; the local feature map and the candidate feature map are fused to obtain a first feature map;

[0042] The candidate feature map and the target feature map are input into the multi-scale global context module so that the candidate feature map and the target feature map are fused to obtain a second feature map;

[0043] The first feature map and the second feature map are input into the super-resolution module so that the first feature map and the second feature map are fused to obtain a third feature map, and the third feature map is subjected to feature extraction and upsampling to obtain the super-resolution image block.

[0044] This application also provides a target detection device, the device comprising:

[0045] The dataset acquisition module is used to acquire a training dataset for training the object detection model; the training dataset includes original images;

[0046] The candidate image patch determination module is used to detect targets from the original image using a pre-trained specified model, and to extract candidate image patches containing the targets from the original image;

[0047] The feature map determination module is used to determine the feature map of the candidate image block based on the pixel values ​​of each pixel in the candidate image block; the feature map includes at least one of a gradient saliency map for characterizing the gradient difference between each pixel, a texture heterogeneity map for characterizing the texture difference between each pixel, and a local contrast map for characterizing the grayscale difference between each pixel.

[0048] The target image block determination module is used to select target sampling points within the candidate image blocks based on the feature map; and to extract target image blocks containing the target sampling points from the original image.

[0049] The super-resolution module is used to perform super-resolution processing on the target image patch using the target detection model to obtain a super-resolution image patch.

[0050] The training module is used to iteratively train the target detection model using the super-resolution image patch until the training termination condition is met, and the final target detection model is obtained.

[0051] This application also provides an electronic device, which includes:

[0052] Memory, used to store computer programs;

[0053] A processor for implementing the steps of any of the above target detection methods when executing the computer program.

[0054] This application also provides a storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described target detection methods.

[0055] In summary, this application provides a method, apparatus, electronic device, and storage medium for object detection. First, candidate image patches containing the target are extracted from the original image. Based on the pixel values ​​of each pixel in the candidate image patch, a feature map of the candidate image patch is determined. Then, according to the feature map, target sampling points are selected within the candidate image patch, and a target image patch containing the target sampling points is extracted from the original image. The target image patch is then subjected to super-resolution processing using an object detection model to obtain a super-resolution image patch. The target image patch is a local image patch of the target's neighborhood, and its size is smaller than the entire original image. Super-resolution processing of the target image patch requires less processing time, thus saving computational resources; it also shortens the training time of the object detection model, accelerates training speed, and improves object detection efficiency. Finally, the object detection model is iteratively trained using the super-resolution image patch to improve its feature extraction capability, especially its ability to extract features from the target's neighborhood in the image, making the object detection model more sensitive to target features and thus improving the accuracy of object detection. Attached Figure Description

[0056] Figure 1 A flowchart illustrating a target detection method provided in this application;

[0057] Figure 2 This application provides a schematic diagram of the structure of a target detection model and a super-resolution enhancement module.

[0058] Figure 3 A schematic diagram illustrating the principle of generating a second feature map using the global context module provided in this application;

[0059] Figure 4 A schematic diagram illustrating the principle of generating a first feature map using a proximity integration module, as provided in this application;

[0060] Figure 5 This is a schematic diagram of the structure of a target detection device provided in this application. Detailed Implementation

[0061] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.

[0062] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0063] Please refer to Figure 1 , Figure 1 This application provides a flowchart illustrating a target detection method, which includes:

[0064] S101. Obtain the training dataset for training the object detection model; the training dataset includes the original images.

[0065] The aforementioned object detection model is a deep learning-based neural network model used to detect objects from input images and generate object detection results. The object detection results include at least a predicted bounding box used to label the location of the object in the image, and a confidence score indicating the presence of the object within the predicted bounding box. This application does not impose any specific limitations on the specific structure of the object detection model.

[0066] This application does not impose any particular limitation on the original image. For example, infrared images acquired by infrared sensors mounted on remote sensing platforms (such as drones, aircraft, and satellites) can be used as the original images.

[0067] S102. Detect the target from the original image using a pre-trained specified model, and extract candidate image patches containing the target from the original image.

[0068] This application does not specifically limit the type of the specified model mentioned above; for example, a pre-trained YOLO model can be used as the specified model. Please refer to... Figure 2 , Figure 2 This application provides a schematic diagram of the structure of an object detection model and a super-resolution enhancement module. The specified model is used to detect objects in the original image and determine the location of the objects in the original image, so as to extract candidate image patches containing the objects from the original image. The candidate image patches are equivalent to local images in the original image that are adjacent to the objects.

[0069] This application does not specifically limit the method for extracting candidate image patches. As an optional embodiment, a specified predicted bounding box output by a specified model is obtained; images located within the specified predicted bounding box are extracted from the original image to obtain candidate image patches. Alternatively, the image patch size (including length and width) of the candidate image patch is preset; images centered on the target and with a size equal to the image patch size are extracted from the original image to obtain candidate image patches. Of course, ground truth bounding boxes pre-annotated to the original image can also be directly used as candidate image patches.

[0070] S103. Based on the pixel values ​​of each pixel in the candidate image block, determine the feature map of the candidate image block; the feature map includes at least one of the following: a gradient saliency map for characterizing the gradient difference between each pixel, a texture heterogeneity map for characterizing the texture difference between each pixel, and a local contrast map for characterizing the grayscale difference between each pixel.

[0071] In candidate images, the target's neighborhood typically exhibits significant changes in grayscale or color, particularly the target's edges and contours; while the background region usually shows smooth transitions in grayscale and color. Based on this, this application determines the gradient saliency map of the candidate image patch based on the pixel values ​​of each pixel within the patch. In the gradient saliency map, pixels in the target's neighborhood have high pixel values, meaning the target's neighborhood is highlighted; while pixels in the background region have low pixel values, meaning the background region is relatively dark. Therefore, the gradient saliency map can distinguish between the target and the background in the candidate image patch, providing a basis for subsequent sampling of the target's neighborhood.

[0072] This application does not specifically limit the method for determining the gradient saliency map. As an optional embodiment, the gradient value of each pixel is determined using the Sobel operator or the Robert operator; for each pixel, the mean local gradient and the variance local gradient within the local window containing that pixel are determined; the product of the mean local gradient and the variance local gradient is taken as the gradient saliency value of that pixel; the gradient saliency values ​​of each pixel are combined to obtain the gradient saliency map, the size of which is consistent with that of the candidate image patch.

[0073] In candidate images, the target's neighborhood typically exhibits complex textures, while the background region usually has homogeneous textures. Based on this, this application determines the texture heterogeneity map of a candidate image patch based on the pixel values ​​of each pixel. The pixel value of each pixel in the texture heterogeneity map represents the degree of texture heterogeneity, or texture complexity, of the local region where that pixel resides. Pixels in the target's neighborhood have larger pixel values, while pixels in the background region have smaller pixel values, thus distinguishing the target from the background in the candidate image patch and providing a basis for subsequent sampling of the target's neighborhood. Specifically, the texture heterogeneity map can be calculated using the LBP (Local Binary Pattern) operator, which will not be elaborated upon here.

[0074] In the candidate image, the gray values ​​of pixels within the target region differ significantly from those of surrounding pixels, while the gray values ​​of pixels in the background region are relatively similar. Based on this, this application determines the local contrast map of the candidate image patch based on the pixel values ​​of each pixel in the candidate image. The local contrast map characterizes the degree of difference between the gray values ​​of a pixel and its neighboring pixels. In the local contrast map, the pixel values ​​of pixels in the target's neighboring region are large, while the pixel values ​​of pixels in the background region are small, thus distinguishing the target from the background in the candidate image patch and providing a basis for subsequent sampling of the image in the target's neighboring region. Specifically, the local contrast map can be obtained by calculating the local standard deviation of the GLCM (Gray Level Co-occurrence Matrix) matrix of each pixel in the candidate image patch, which will not be elaborated upon here.

[0075] In summary, the feature maps of candidate image patches can reflect the differences between the target's neighboring region and the background region, thereby distinguishing the target and the background in the candidate image patch and providing a basis for subsequent targeted sampling of the target's neighboring region.

[0076] S104. Based on the feature map, select target sampling points within the candidate image blocks; and extract the target image blocks containing the target sampling points from the original image.

[0077] This application selects target sampling points within candidate image blocks based on feature maps. On one hand, this is equivalent to selecting target sampling points only within a local area of ​​the original image containing the target. On the other hand, it can also distinguish between the target's neighborhood and the background region of the candidate image block based on the feature map, thereby selecting a large number of target sampling points in the target's neighborhood and a small number in the background region. This achieves dense sampling in the target's neighborhood and sparse sampling in the background region, i.e., adaptive sampling. The specific method for selecting target sampling points within candidate image blocks based on feature maps will be described in subsequent embodiments and will not be elaborated here.

[0078] After determining the target sampling points, extract the target image patch containing the target sampling points from the original image. Figure 2 (See the yellow box in the image). As an optional embodiment, the target image patch size (including the target image patch length and the target image patch width) is preset, and an image centered on the target sampling point with a size equal to the target image patch size is extracted from the original image as the target image patch. The target image patch obtained in this application is a local image patch of the target's neighboring region, rather than the entire original image; compared to the original image, the size of the target image patch is smaller.

[0079] This application does not impose a specific limitation on the size of the target image patch. For example, considering that targets with a width of less than 32 pixels and a length of less than 32 pixels are usually defined as small targets, when using a target detection model to detect small targets, in order to ensure that the target image patch can cover the small target, the size of the target image patch can be set to two to three times the size of the small target. The specific setting can be determined according to actual needs.

[0080] S105. Use the target detection model to perform super-resolution processing on the target image patch to obtain a super-resolution image patch.

[0081] The purpose of super-resolution processing on target image patches is to increase the number of pixels and detail information of the target image patches, obtaining high-resolution super-resolution image patches. This allows the target detection model to be more sensitive to image details during training, thereby improving the accuracy of target detection. The specific implementation of super-resolution processing on target image patches using the target detection model will be described in detail in subsequent embodiments, and will not be elaborated here.

[0082] Since the target image patch is a local image patch of the target's neighborhood, rather than the entire original image, its size is smaller compared to the entire original image. Therefore, the super-resolution processing of the target image patch requires less processing time, thus saving computational resources, shortening the training time of the object detection model, accelerating training speed, and improving object detection efficiency.

[0083] S106. Using super-resolution image patches, iteratively train the target detection model until the training termination condition is met, and obtain the final target detection model.

[0084] After obtaining the super-resolution image patches, the object detection model is iteratively trained using these patches. Specifically, a super-resolution reconstruction loss function can be constructed using the super-resolution image patches and the target image patches; a detection loss function can be constructed using the target detection results output by the object detection model and the pre-labeled real detection results; a joint loss function is constructed based on the super-resolution reconstruction loss function and the detection loss function; the parameters of the object detection model are adjusted until the joint loss function reaches its minimum value, thus obtaining the final object detection model. The specific process of iteratively training the object detection model using super-resolution image patches will be explained in subsequent embodiments and will not be elaborated here.

[0085] Iterative training of the target detection model using super-resolution image patches can improve the model's feature extraction capabilities, especially its ability to extract features from the target's neighborhood in the image. This makes the target detection model more sensitive to the target's features, thereby improving the accuracy of target detection.

[0086] In summary, this application provides an object detection method. First, candidate image patches containing the target are extracted from the original image. Based on the pixel values ​​of each pixel in the candidate image patch, a feature map of the candidate image patch is determined. Then, according to the feature map, target sampling points are selected within the candidate image patch, and target image patches containing the target sampling points are extracted from the original image. The target image patch is then subjected to super-resolution processing using an object detection model to obtain a super-resolution image patch. The target image patch is a local image patch of the target's neighborhood, and its size is smaller than the entire original image. Super-resolution processing of the target image patch requires less processing time, thus saving computational resources; it also shortens the training time of the object detection model, accelerates training speed, and improves object detection efficiency. Finally, the object detection model is iteratively trained using the super-resolution image patch to improve its feature extraction capability, especially its ability to extract features from the target's neighborhood in the image, making the object detection model more sensitive to target features and thus improving the accuracy of object detection.

[0087] Based on the above embodiments:

[0088] The process of selecting target sampling points within candidate image blocks based on feature maps is explained below.

[0089] As an optional embodiment, selecting target sampling points within candidate image blocks based on feature maps includes:

[0090] A sampling grid is constructed on the candidate image patch, and the grid points of the sampling grid are used as the initial sampling points; the spacing between adjacent grid points in the sampling grid is a preset spacing.

[0091] The feature map is subjected to specified processing to obtain the probability map corresponding to the candidate image patch; the specified processing includes at least normalization processing; the probability map is composed of probability values ​​that correspond one-to-one with each pixel.

[0092] The probability value of the probability map at the initial sampling point is used as the target probability value;

[0093] Generate standard Gaussian random numbers as offset coefficients for the initial sampling points;

[0094] The target offset distance of the initial sampling point is the product of the target probability value, the offset coefficient, and the preset unit offset distance; the unit offset distance is less than the preset interval.

[0095] Move the initial sampling point by the target offset distance along the specified direction to obtain the target sampling point.

[0096] As mentioned earlier, in gradient saliency maps, texture heterogeneity maps, and local contrast maps, pixels in the target's neighborhood have large pixel values, while pixels in the background region have small pixel values. By performing specified processing (at least normalization) on the feature maps of candidate image patches, a probability map is obtained. The probability value of a pixel in the probability map is positively correlated with the likelihood that the pixel is the target. The process of generating the probability map is explained in detail below.

[0097] If the feature map is a gradient saliency map, a texture heterogeneous map, or a local contrast map, then the feature map is normalized, that is, the pixel value of each pixel in the feature map is normalized to the interval [0, 1] to obtain the probability map.

[0098] If the feature map includes any two or all of the gradient saliency map, texture heterogeneity map, and local contrast map, then each feature map is first normalized; then, each feature map is weighted and summed using its corresponding preset weights to obtain an initial probability map; finally, the initial probability map is normalized to obtain the aforementioned probability map. This embodiment does not impose any particular limitation on the values ​​of the preset weights; the preset weights can be set according to the contribution of the gradient saliency map, texture heterogeneity map, and local contrast map to the probability map.

[0099] For example, the feature map includes a gradient saliency map Sgrad1, a texture heterogeneity map Stexture1, and a local contrast map Scontrast1. Sgrad1, Stexture1, and Scontrast1 are normalized to obtain Sgrad2, Stexture2, and Scontrast2, respectively. Using the preset weights w1 (for the gradient saliency map), w2 (for the texture heterogeneity map), and w3 (for the local contrast map), Sgrad2, Stexture2, and Scontrast2 are weighted and summed to obtain the initial probability map Pinitial = w1 * Sgrad2 + w2 * Stexture2 + w3 * Scontrast2, where w1 = 0.4, w2 = 0.3, and w3 = 0.3. Alternatively, Gaussian filtering can be applied to the initial probability map Pinitial to suppress noise and highlight potential targets. Finally, the initial probability map Pinitial is normalized to obtain the probability map Pmap.

[0100] Based on this, this embodiment first constructs a sampling grid on the candidate image block, and uses the grid points of the sampling grid as initial sampling points. The spacing between adjacent grid points in the sampling grid is a preset spacing, that is, each initial sampling point is evenly distributed on the candidate image block.

[0101] The aforementioned preset spacing can be set according to actual needs. For example, targets with a width and length of less than 32 pixels are typically defined as small targets. The width of the small target can be used as the target width, and three times the target width can be defined as the preset spacing. This ensures comprehensive sampling of candidate image patches without significantly increasing the number of initial sampling points. Simultaneously, this preset spacing provides offset space for subsequent positional shifting of the initial sampling points based on probability values.

[0102] In this embodiment, the probability value of the probability map at the initial sampling point is used as the target probability value; a standard Gaussian random number is generated for each initial sampling point as the offset coefficient; the product of the target probability value, the offset coefficient, and the unit offset distance is used as the target offset distance of the initial sampling point. The aforementioned unit offset distance is less than a preset interval, and its specific value can be set according to actual needs, for example, set to 0.75 times the preset interval. It can be seen that the larger the target probability value of the initial sampling point, the larger the target offset distance; the smaller the target probability value of the initial sampling point, the smaller the target offset distance.

[0103] Based on this, the initial sampling point is moved by the target offset distance along a specified direction to obtain the target sampling point. This embodiment does not specifically limit the specified direction; for example, a standard coordinate system can be established with the lower left corner of the candidate image patch as the origin. In this standard coordinate system, the positive direction of the horizontal axis is taken as the first specified direction, and the positive direction of the vertical axis is taken as the second specified direction. Based on this, the initial sampling point is moved by the target offset distance along the first specified direction and then by the target offset distance along the second specified direction to obtain the target sampling point.

[0104] If the coordinates of the initial sampling point in the standard coordinate system are The target probability value is The unit offset distance is Standard Gaussian random numbers are The target offset distance of the initial sampling point Along the first and second specified directions, the initial sampling point is offset to obtain the coordinates of the target sampling point. .

[0105] As can be seen, in the candidate image patch, the target offset distance of the initial sampling points in the target's neighborhood, i.e., the high-probability region, is large. Therefore, after moving the initial sampling points by the target offset distance, the resulting target sampling points deviate from the grid points of the initial grid and are densely distributed in the high-probability region, thus achieving dense sampling in the high-probability region. In the low-probability region, the target offset distance of the initial sampling points is small, and the resulting target sampling points are basically kept near the grid points of the initial grid and sparsely distributed in the low-probability region, thus achieving sparse sampling in the low-probability region.

[0106] Based on this, when extracting target image patches containing target sampling points from the original image, the target image patches will be densely distributed in high-probability regions and sparsely distributed in low-probability regions. That is, more target image patches are obtained in the target's vicinity and fewer target image patches are obtained in the background region, saving the computational resources and time required for subsequent super-resolution processing and improving the accuracy of target detection.

[0107] As an optional embodiment, after moving the initial sampling point by the target offset distance along a specified direction to obtain the target sampling point, the method further includes:

[0108] Determine the total number of sampling points for the target sampling point;

[0109] The product of the preset coefficient and the total number of sampling points is used as the supplementary sampling point number;

[0110] In the candidate image patch, select several first supplementary sampling points;

[0111] The probability value at the first supplementary sampling point of the probability map is taken as the first supplementary probability value;

[0112] Among the first supplementary sampling points, the supplementary sampling points whose first supplementary probability value is greater than the preset probability value are taken as the added target sampling points.

[0113] As can be seen from the above embodiments, the total number of target sampling points is equal to the total number of grid points in the sampling grid. If the total number of sampling points is insufficient (e.g., less than the preset number of sampling points), additional sampling points can be added. Simultaneously, it is necessary to ensure that the additional sampling points are densely distributed in the target's vicinity and sparsely distributed in the background area.

[0114] First, determine the number of additional sampling points required. Specifically, the product of the total number of target sampling points and a preset coefficient is used as the number of additional sampling points. The preset coefficient can be set according to actual needs, for example, within the range of 0.3 to 0.5.

[0115] Subsequently, several first supplementary sampling points are selected from the candidate image blocks. This embodiment does not impose any particular restrictions on the rules for selecting the first supplementary sampling points. For example, several first supplementary sampling points may be randomly selected from the candidate image blocks; or the spacing between the supplementary sampling points may be determined based on the size of the candidate image block and the number of supplementary sampling points; and the first supplementary sampling points may be uniformly selected from the candidate image blocks according to the spacing between the supplementary sampling points.

[0116] In this embodiment, after obtaining the first supplementary sampling point, the probability value of the probability map at the first supplementary sampling point is used as the first supplementary probability value. The first supplementary probability value is positively correlated with the probability that the first supplementary sampling point belongs to the target's neighborhood, so as to determine whether to accept the first supplementary sampling point based on the first supplementary probability value.

[0117] Specifically, if the first supplementary probability value is greater than the preset probability value, it indicates that the corresponding first supplementary sampling point is likely located in the target's vicinity. Therefore, the first supplementary sampling point is accepted, meaning it is added as an additional target sampling point to supplement the sampling points in the target's vicinity and maintain a dense distribution of sampling points in that region. If the first supplementary probability value is not greater than the preset probability value, it indicates that the corresponding first supplementary sampling point is likely located in the background region. Therefore, the first supplementary sampling point is rejected to avoid wasting computational resources and processing time, and to avoid negatively impacting the feature extraction capabilities of the target detection model.

[0118] As an optional embodiment, after moving the initial sampling point by the target offset distance along a specified direction to obtain the target sampling point, the method further includes:

[0119] Obtain the pre-constructed mapping relationship; the mapping relationship is the correspondence between the probability value corresponding to the sampling point and the sampling density; the probability value and the sampling density are positively correlated; the sampling density is the density of sampling points in an area centered on the sampling point with a preset length and a preset width;

[0120] Based on the mapping relationship, determine the target sampling density corresponding to the target probability value;

[0121] Determine the actual sampling density of the target area where the target sampling point is located;

[0122] If the actual sampling density is less than the target sampling density, the sampling interval is determined based on the target sampling density; according to the sampling interval, a second supplementary sampling point is selected in the target area as an additional target sampling point.

[0123] In this embodiment, if the sampling density of the target area where the target sampling point is located does not meet the requirements, additional sampling points can be added to the target area. Furthermore, it is still necessary to ensure that the additional sampling points are densely distributed in the vicinity of the target area and sparsely distributed in the background area.

[0124] The target area mentioned above is the region centered on the target sampling point, with a preset length and a preset width. As mentioned earlier, targets with a width less than 32 pixels and a length less than 32 pixels are usually defined as small targets. The preset length and preset width can be set to two to three times the length of the small target.

[0125] When the target probability values ​​of target sampling points are different, the expected sampling density, i.e., the target sampling density, of the target region where the target sampling point is located will also be different. When the target probability value is larger, the target sampling density is larger, thus ensuring that after supplementing sampling points, all sampling points still follow the characteristic of being densely distributed in the target's vicinity and sparsely distributed in the background region.

[0126] Specifically, based on a pre-constructed mapping relationship, the target sampling density corresponding to the target probability value is determined. The mapping relationship is the correspondence between the probability value of a sampling point and the sampling density, and the probability value and sampling density are positively correlated. As an optional embodiment, the mapping relationship can be expressed as follows: ,in, The minimum sampling density is set in advance. The maximum sampling density is preset. Let be the probability value corresponding to the sampling point. It can be seen that the larger the probability value corresponding to the sampling point, i.e., the target probability value, the closer the target sampling density is to the highest sampling density; the smaller the target probability value, the closer the target sampling density is to the lowest sampling density.

[0127] Next, based on the number of target sampling points within the target area and the size of the target area, the actual sampling density of the target area is determined. If the actual sampling density is less than the target sampling density, it indicates that there are insufficient sampling points within the target area, and therefore additional sampling points need to be added within the target area. Specifically, the sampling interval is determined based on the target sampling density and the size of the target area. It can be understood that the higher the target sampling density, the smaller the sampling interval; and the lower the target sampling density, the larger the sampling interval. According to the sampling interval, a second supplementary sampling point is evenly selected within the target area to obtain the increased target sampling points.

[0128] As an optional embodiment, after moving the initial sampling point by the target offset distance along a specified direction to obtain the target sampling point, the method further includes:

[0129] The perturbation coefficient is determined based on the target probability value; the perturbation coefficient is negatively correlated with the target probability value.

[0130] The product of the disturbance coefficient and the preset unit disturbance distance is used as the target disturbance distance; the unit disturbance distance is less than the preset spacing.

[0131] Move the target sampling point a specified distance along the specified direction to obtain the final target sampling point.

[0132] In this embodiment, a micro-perturbation is applied to the target sampling point based on the target probability value, thereby ensuring full coverage of the target by the sampling points while maintaining a dense distribution of the target sampling points in the target's vicinity and a sparse distribution in the background area.

[0133] As an optional embodiment, the difference obtained by subtracting the target probability value is used as the perturbation coefficient; a pre-set target image block length or one-quarter of the target image block width is used as the unit perturbation distance; and the product of the perturbation coefficient and the unit perturbation distance is used as the target perturbation distance.

[0134] If the target probability value of the target sampling point is The unit disturbance distance is The coordinates of the target sampling point are Then the disturbance coefficient Target disturbance distance The coordinates of the target sampling point after perturbation are .

[0135] It can be seen that the target perturbation distance is negatively correlated with the target probability value. That is, the target sampling points in the high probability area, i.e. the target neighborhood area, have a small target perturbation distance; the target sampling points in the low probability area, i.e. the background area, have a large target perturbation distance. This ensures that after applying micro-perturbation, each target sampling point still follows the characteristic of being densely distributed in the target neighborhood area and having a coefficient distribution in the background area.

[0136] Based on this, the target sampling points are moved along a specified direction by a target perturbation distance to obtain the final target sampling points. For target sampling points in the target's vicinity, the target perturbation distance is relatively small, so the perturbed target sampling points are still densely distributed in the target's vicinity. For target sampling points in the background region, the target perturbation distance is relatively large, so the perturbed target sampling points are scattered in the candidate image patches, thereby improving the coverage of the target by the target sampling points.

[0137] In summary, this application can autonomously and dynamically select target sampling points based on the feature maps of candidate image blocks, that is, automatically and dynamically select target image blocks that need to be super-resolution processed, thereby achieving the purpose of super-resolution processing of image blocks in the neighboring regions of the target in the original image, saving processing time and computing resources.

[0138] The specific structure of the object detection model and its training process are explained below.

[0139] As an optional embodiment, the target detection model includes a backbone network, a neck network, and a detection head; the backbone network is used to extract features from the original image to generate original feature maps at different scales; the neck network is used to perform feature fusion on the original feature maps; the detection head is used to generate target detection results based on the original feature maps after feature fusion; the target detection results include predicted bounding boxes and confidence scores indicating the presence of targets within the predicted bounding boxes.

[0140] Using super-resolution image patches, the object detection model is iteratively trained until the training termination condition is met, resulting in the final object detection model, including:

[0141] Obtain the pre-constructed joint loss function; the joint loss function includes an object detection loss term, a confidence loss term, and a super-resolution reconstruction loss term; wherein, the object detection loss term is determined based on the degree of overlap between the predicted bounding box and the pre-labeled ground truth bounding box; the confidence loss term is the difference between the confidence of this training and the confidence of the previous training; the super-resolution reconstruction loss term is the Euclidean distance between the super-resolution image patch and the target image patch;

[0142] Based on the joint loss function, the object detection model is trained in segments until the joint loss function reaches its minimum value, thus obtaining the final object detection model.

[0143] In this embodiment, the backbone network in the object detection model is used to extract features from the original image, generating multi-scale original feature maps to capture both global and local detail information of the original image. For example... Figure 2As shown, HRNet-w18 (High-Resolution Network-Width 18) can be used as the backbone network. In HRNet-w18, the network is divided into different stages, each generating a raw feature map with a different resolution. In this embodiment, the output of Stage 2 in HRNet-w18 is used as the first raw feature map, the output of Stage 3 is used as the second raw feature map, and the outputs of Stage 4 are used as the third raw feature map, in order to balance the detection capability for targets of different sizes and improve detection accuracy.

[0144] The first original feature map has a resolution of 1 / 4 of the original image resolution. This fine-grained feature map preserves more spatial details, facilitating accurate detection of small targets from the original image. The second original feature map has a resolution of 1 / 8 of the original image resolution. This feature map balances detail and semantic information, aiding in the detection of medium-sized targets from the original image. The third original feature map has a resolution of 1 / 16 of the original image resolution. This feature map contains rich semantic information, aiding in the detection of large targets from the original image.

[0145] Furthermore, after obtaining the multi-scale original feature maps generated by the backbone network, the channel dimension of each original feature map can be adjusted by introducing a 1*1 convolutional layer to match the dimension of the downstream neck network.

[0146] In this embodiment, the target detection network further includes a neck network and a detection head. The neck network is used to fuse multi-scale original feature maps, and the detection head is used to generate target detection results based on the fused original feature maps. This embodiment does not limit the structure of the neck network and the detection head. For example, the neck network of YOLO-v8 can be used as the aforementioned neck network, and the detection head of YOLO-v8 can be used as the aforementioned detection head.

[0147] In this embodiment, a joint loss function is constructed for the target detection model, and the parameters of the target detection model are adjusted by segmented joint training to improve the target detection accuracy.

[0148] Specifically, the joint loss function can be expressed as: ,in, For the over-resolution reconstruction loss term, For target detection loss term, For confidence level loss, The first weight corresponding to the super-resolution reconstruction loss term. The second weight corresponding to the target detection loss term. This is the third weight corresponding to the confidence loss term. This embodiment does not impose any special limitations on the values ​​of the above weights; they can be set according to the contribution of the super-reconstruction loss term, the target detection loss term, and the confidence loss term to the joint loss function.

[0149] The super-resolution enhancement module performs super-resolution processing on the target image patch based on the multi-scale original feature maps output by the backbone network, resulting in a super-resolution image patch. Using the Euclidean distance between the target image patch and the super-resolution image patch as the super-resolution reconstruction loss term allows the super-resolution reconstruction task to focus more on overall smoothness, avoiding noise enhancement and the introduction of false alarms. The above super-resolution reconstruction loss term can be expressed as: ,in, For the target image patch, This is a super-resolution image patch.

[0150] Introducing super-resolution reconstruction loss into the joint loss function and adjusting the parameters of the backbone network can improve the feature extraction capability of the backbone network, especially the feature extraction capability of the target's neighborhood in the image. It is more sensitive to the target's features, thereby improving the target detection capability of the target detection model.

[0151] The aforementioned object detection loss term is determined based on the degree of overlap between the predicted bounding box and the ground truth bounding box, where the ground truth bounding box is a pre-annotated detection box in the original image. As an optional embodiment, the overlap area between the predicted and ground truth bounding boxes is calculated based on the coordinates of the four corner points of the predicted bounding box and the four corner points of the ground truth bounding box; the ratio obtained by dividing the overlap area by the total area (i.e., the sum of the areas of the predicted and ground truth bounding boxes) is used as the IoU value (Intersection over Union). The object detection loss term can be expressed as: .

[0152] The confidence loss term described above represents the difference between the confidence score of the target detection result obtained in this training iteration and the confidence score of the target detection result obtained in the previous training iteration. The confidence loss can be expressed as: ,in, This represents the confidence level in the target detection results obtained during this training. This represents the confidence level of the target detection results obtained in the previous training.

[0153] Introducing the aforementioned object detection loss term and confidence loss term into the joint loss function allows the parameters of the object detection model to be adjusted during iterative training to improve its object detection capability, thereby enhancing the accuracy of object detection.

[0154] Based on the aforementioned joint loss function, the object detection model is trained in segments. This means that the training of the object detection model is divided into multiple stages. At different stages, parts of the object detection model's network are frozen, and the parameters of the unfrozen network are adjusted.

[0155] As an optional implementation, in the first stage of training, the current object detection model is used to generate object detection results, and the object detection loss is determined based on the ground truth bounding boxes and the predicted bounding boxes in the object detection results; the parameters of the backbone network are fixed, and the parameters of the neck network and the detection head are adjusted only using the object detection loss to obtain the object detection model with updated parameters.

[0156] In the second stage of training, the super-resolution enhancement module performs super-resolution processing on the target image patch based on the original feature map generated by the current target detection model to obtain a super-resolution image patch; the Euclidean distance between the super-resolution image patch and the target image patch is used as the super-resolution reconstruction loss term; the parameters of the neck network and the detection head are fixed, and the parameters of the backbone network are adjusted only using the super-resolution reconstruction loss term to obtain the target detection model with updated parameters.

[0157] In the third stage of training, object detection results are generated using the current object detection model. The confidence loss is determined based on the difference between the confidence scores in the current object detection results and those in the previous training iteration. The parameters of the backbone network are fixed, and the parameters of the neck network and the detection head are adjusted using only the object detection loss to obtain the updated object detection model, at which point the training ends. Alternatively, the difference between the confidence scores in the third stage and those in the first stage can also be used as the confidence loss.

[0158] In summary, the training of the object detection model is divided into three stages, with the parameters of the backbone network and the detection network (i.e., the neck network and the detection head) being updated alternately until the iterative training is completed.

[0159] Understandably, once the joint loss function reaches its minimum, i.e., the final object detection model is obtained, the super-resolution enhancement module can be discarded, and the object detection model can be used for object detection.

[0160] In summary, in this embodiment, during the training phase of the target detection model, a backbone-dual-branch architecture (i.e., backbone network, target detection branch (neck network and detection head), and super-resolution branch (i.e., super-resolution enhancement module)) is constructed to improve the model's feature extraction capability of images in the vicinity of the target. The parameters of the target detection model are adjusted in the direction of improving the target detection capability to achieve accurate target detection.

[0161] The following describes the process by which the super-resolution enhancement module performs super-resolution processing on the target image block.

[0162] As an optional embodiment, the super-resolution enhancement module includes a neighbor ensemble module, a multi-scale global context module, and a super-resolution module; it uses a target detection model to perform super-resolution processing on target image patches to obtain super-resolution image patches, including:

[0163] Obtain the original feature map; and crop out the candidate feature map corresponding to the candidate image patch and the target feature map corresponding to the target image patch from the original feature map;

[0164] The candidate feature map and the original feature map are input into the neighbor integration module to enhance the original feature map and obtain an enhanced feature map; local feature maps corresponding to the candidate image patches are cropped from the enhanced feature map; the local feature maps and the candidate feature maps are fused to obtain the first feature map;

[0165] The candidate feature map and the target feature map are input into the multi-scale global context module so that the candidate feature map and the target feature map can be fused to obtain the second feature map;

[0166] The first feature map and the second feature map are input into the super-resolution module so that the first feature map and the second feature map are fused to obtain the third feature map. The third feature map is then used for feature extraction and upsampling to obtain a super-resolution image block.

[0167] In this embodiment, the original feature map is first cropped to obtain candidate feature maps corresponding to candidate image blocks and target feature maps corresponding to target image blocks, thereby reducing computational load. As an optional embodiment, the coordinates of the four corner points of the candidate image block are first determined, and an image with corner coordinates consistent with those of the candidate image block is cropped from the original feature map as the candidate feature map corresponding to the candidate image block. Similarly, the coordinates of the four corner points of the target image block are determined, and an image with corner coordinates consistent with those of the target image block is cropped from the original feature map as the target feature map corresponding to the target image block. The original feature maps at each scale are cropped according to the above steps, which will not be elaborated upon in this embodiment.

[0168] It should be noted that the backbone network generates original feature maps at multiple scales. Therefore, the original feature maps at each scale need to be cropped as described above to obtain candidate feature maps and target feature maps at multiple scales.

[0169] Of course, it can also be like Figure 2As shown, the candidate image patch is aligned with the backbone network using the feature alignment module (RoI module, Region of InterestAlign module), and then input into the backbone network to generate multi-scale candidate feature maps; the target image patch is aligned with the backbone network using the feature alignment module, and then input into the backbone network to generate multi-scale target feature maps.

[0170] The Multiscale Global Context Module (MGCM) in the super-resolution enhancement module is used to fuse candidate feature maps and target feature maps to obtain a second feature map. The goal of the MGCM is to retrieve semantic information related to the target from the candidate feature map, without spatial or distance limitations, thereby compensating for the lack of detail caused by limited receptive fields, enhancing the super-resolution reconstruction effect of target image patches, and ensuring computational efficiency. This embodiment does not impose any particular limitation on the specific network structure of the MGCM; for example, it can consist of a convolutional neural network including multiple input heads and a spatial pyramid pooling layer.

[0171] like Figure 2 As shown, the multi-scale global context module includes a global context module C1, a second global context module C2, and a third global context module C3. Different global context modules are used to process and fuse feature maps of different scales. For example, global context module C1 is used to fuse a first candidate feature map and a first target feature map; wherein, the first candidate feature map is a candidate feature map cropped from the first original feature map, and the first target feature map is a target feature map cropped from the first original feature map; and so on, the second global context module C2 is used to fuse a second candidate feature map and a second target feature map; the third global context module C3 is used to fuse a third candidate feature map and a third target feature map.

[0172] Please refer to Figure 3 , Figure 3 This is a schematic diagram illustrating the principle of generating a second feature map using a global context module provided in this application. Taking the first global context module C1 as an example, the first candidate feature map is first linearly projected to obtain the Q vector; the first target feature map is then linearly projected to obtain the K vector and V vector; based on the cross-attention mechanism, context information is aggregated and upsampled to obtain the second feature map G after fusing the first candidate feature map and the first target feature map.

[0173] The Proximity Integration Module (PIM) in the super-resolution enhancement module is used to enhance the original feature map to obtain an enhanced feature map; local feature maps corresponding to candidate image patches are cropped from the enhanced feature map; and the local feature maps and candidate feature maps are fused to obtain a first feature map. The first feature map integrates features from the target's neighboring regions to aid in super-resolution processing.

[0174] like Figure 2 As shown, the neighborhood ensemble module includes a first neighborhood ensemble submodule B1, a second neighborhood ensemble submodule B2, and a third neighborhood ensemble submodule B3. Different neighborhood ensemble submodules are used to process original feature maps of different scales and candidate feature maps of different scales. Taking the first neighborhood ensemble submodule B1 as an example, it is used to enhance the first original feature map to obtain an enhanced feature map; crop the local feature map corresponding to the candidate image patch from the enhanced feature map; and fuse the local feature map with the first candidate feature map.

[0175] Please refer to Figure 4 , Figure 4 This application provides a schematic diagram of the principle of generating a first feature map using a neighbor ensemble module. Taking the first neighbor ensemble submodule B1 as an example, the first original feature map output by the backbone network is obtained and enhanced to obtain an enhanced feature map p; a local feature map p' corresponding to the candidate image block is cropped from the enhanced feature map p; the local feature map p' is fused with the first candidate feature map Z to obtain the first feature map Z'.

[0176] Finally, the first and second feature maps are input into the super-resolution module to fuse them into a third feature map. This third feature map is then used for feature extraction and upsampling to obtain a super-resolution image patch. For example... Figure 2 As shown, the super-resolution module includes a first super-resolution module A1, a second super-resolution module A2, and a third super-resolution module A3. Different super-resolution modules are used to process first feature maps of different scales and second feature maps of different scales. This embodiment will not elaborate on these details.

[0177] In summary, the super-resolution enhancement module in this embodiment includes a neighbor integration module, a multi-scale global context module, and a super-resolution module. These modules work together to perform super-resolution processing on the target image patch using the multi-scale original feature map generated based on the backbone network, resulting in a super-resolution image patch. This super-resolution image patch is then used to iteratively train the target detection model, enhancing the model's feature extraction capability and improving the accuracy of target detection.

[0178] Please refer to Figure 5 , Figure 5This application provides a schematic diagram of the structure of a target detection device, which includes:

[0179] The dataset acquisition module 501 is used to acquire the training dataset for training the object detection model; the training dataset includes the original images.

[0180] The candidate image patch determination module 502 is used to detect targets from the original image using a pre-trained specified model and extract candidate image patches containing targets from the original image.

[0181] The feature map determination module 503 is used to determine the feature map of the candidate image block based on the pixel value of each pixel in the candidate image block; the feature map includes at least one of a gradient saliency map for characterizing the gradient difference between each pixel, a texture heterogeneity map for characterizing the texture difference between each pixel, and a local contrast map for characterizing the gray level difference between each pixel.

[0182] The target image block determination module 504 is used to select target sampling points within candidate image blocks based on feature maps; and to extract target image blocks containing target sampling points from the original image.

[0183] The super-resolution module 505 is used to perform super-resolution processing on the target image patch using the target detection model to obtain the super-resolution image patch.

[0184] Training module 506 is used to iteratively train the object detection model using super-resolution image patches until the training termination condition is met, and the final object detection model is obtained.

[0185] For a detailed description of the target detection device provided in this application, please refer to the embodiments of the target detection method described above; further details will not be repeated here.

[0186] Based on the above embodiments:

[0187] As an optional embodiment, the target image patch determination module 504 includes:

[0188] The initial sampling point determination module is used to construct a sampling grid on the candidate image patch and use the grid points of the sampling grid as initial sampling points; the spacing between adjacent grid points in the sampling grid is a preset spacing;

[0189] The probability map determination module is used to perform specified processing on the feature map to obtain the probability map corresponding to the candidate image patch; the specified processing includes at least normalization processing; the probability map is composed of probability values ​​that correspond one-to-one with each pixel.

[0190] The target probability value determination module is used to take the probability value of the probability map at the initial sampling point as the target probability value;

[0191] The offset coefficient determination module is used to generate standard Gaussian random numbers as offset coefficients for the initial sampling points;

[0192] The target distance determination module is used to take the product of the target probability value, the offset coefficient, and the preset unit offset distance as the target offset distance of the initial sampling point; the unit offset distance is less than the preset interval.

[0193] The offset module is used to move the initial sampling point by a target offset distance along a specified direction to obtain the target sampling point;

[0194] The target image patch extraction module is used to extract target image patches containing target sampling points from the original image.

[0195] As an optional embodiment, the device further includes:

[0196] The total number of sampling points determination module is used to determine the total number of sampling points of the target sampling points after moving the initial sampling points along the specified direction by the target offset distance to obtain the target sampling points;

[0197] The supplementary sampling point determination module is used to take the product of the preset coefficient and the total number of sampling points as the supplementary sampling point number;

[0198] The supplementary sampling point selection module is used to select several first supplementary sampling points from the candidate image block;

[0199] The supplementary probability value determination module is used to take the probability value of the probability map at the first supplementary sampling point as the first supplementary probability value;

[0200] The first supplementary module is used to select supplementary sampling points among the first supplementary sampling points whose first supplementary probability value is greater than a preset probability value as additional target sampling points.

[0201] As an optional embodiment, the device further includes:

[0202] The mapping relationship acquisition module is used to acquire a pre-built mapping relationship after moving the initial sampling point by a target offset distance along a specified direction to obtain the target sampling point. The mapping relationship is the correspondence between the probability value corresponding to the sampling point and the sampling density. The probability value and the sampling density are positively correlated. The sampling density is the density of sampling points in an area centered on the sampling point with a preset length and a preset width.

[0203] The target density determination module is used to determine the target sampling density corresponding to the target probability value based on the mapping relationship;

[0204] The actual density determination module is used to determine the actual sampling density of the target area where the target sampling point is located; if the actual sampling density is less than the target sampling density, the second supplementary module is triggered.

[0205] The second supplementary module is used to determine the sampling interval based on the target sampling density; and to select a second supplementary sampling point within the target area according to the sampling interval, as an additional target sampling point.

[0206] As an optional embodiment, the device further includes:

[0207] The perturbation coefficient determination module is used to determine the perturbation coefficient based on the target probability value after moving the initial sampling point by the target offset distance along a specified direction to obtain the target sampling point; the perturbation coefficient is negatively correlated with the target probability value;

[0208] The target disturbance distance determination module uses the product of the disturbance coefficient and the preset unit disturbance distance as the target disturbance distance; the unit disturbance distance is less than the preset spacing.

[0209] The perturbation module is used to move the target sampling point a perturbation distance along a specified direction to obtain the final target sampling point.

[0210] As an optional embodiment, the object detection model includes a backbone network, a neck network, and a detection head; the backbone network is used to extract features from the original image to generate original feature maps at different scales; the neck network is used to perform feature fusion on the original feature maps; the detection head is used to generate object detection results based on the feature-fused original feature maps; the object detection results include predicted bounding boxes and confidence scores for the presence of objects within the predicted bounding boxes; the training module 506 includes:

[0211] The loss function acquisition module is used to acquire a pre-constructed joint loss function. The joint loss function includes an object detection loss term, a confidence loss term, and a super-resolution reconstruction loss term. Among them, the object detection loss term is determined based on the degree of overlap between the predicted bounding box and the pre-labeled ground truth bounding box; the confidence loss term is the difference between the confidence of this training and the confidence of the previous training; and the super-resolution reconstruction loss term is the Euclidean distance between the super-resolution image patch and the target image patch.

[0212] The segmented joint training module is used to perform segmented joint training on the object detection model based on the joint loss function until the joint loss function reaches its minimum value, thus obtaining the final object detection model.

[0213] As an optional embodiment, the super-resolution enhancement module includes a neighbor ensemble module, a multi-scale global context module, and a super-resolution module; it uses a target detection model to perform super-resolution processing on target image patches to obtain super-resolution image patches, including:

[0214] Obtain the original feature map; and crop out the candidate feature map corresponding to the candidate image patch and the target feature map corresponding to the target image patch from the original feature map;

[0215] The candidate feature map and the original feature map are input into the neighbor integration module to enhance the original feature map and obtain an enhanced feature map; local feature maps corresponding to the candidate image patches are cropped from the enhanced feature map; the local feature maps and the candidate feature maps are fused to obtain the first feature map;

[0216] The candidate feature map and the target feature map are input into the multi-scale global context module so that the candidate feature map and the target feature map can be fused to obtain the second feature map;

[0217] The first feature map and the second feature map are input into the super-resolution module so that the first feature map and the second feature map are fused to obtain the third feature map. The third feature map is then used for feature extraction and upsampling to obtain a super-resolution image block.

[0218] This application also provides an electronic device, which includes:

[0219] Memory, used to store computer programs;

[0220] A processor is used to implement the steps of any of the above target detection methods when executing a computer program.

[0221] For a detailed description of the electronic device provided in this application, please refer to the embodiments of the target detection method described above; further details will not be repeated here.

[0222] This application also provides a storage medium storing a computer program, which, when executed by a processor, implements the steps of any of the above-described target detection methods.

[0223] The aforementioned storage media include all forms of non-volatile memory, media, and memory devices, such as semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disks or removable disks), magneto-optical disks, and CD-ROMs and DVD-ROMs. Processors and memory may be supplemented by or integrated into dedicated logic circuitry.

[0224] For a detailed description of the storage medium provided in this application, please refer to the embodiments of the target detection method described above; further details will not be repeated here.

[0225] While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claims, but rather are primarily intended to describe features of specific embodiments of a particular invention. Certain features described in the various embodiments herein may also be implemented in combination in a single embodiment. Conversely, various features described in a single embodiment may also be implemented separately in various embodiments or in any suitable sub-combination. Furthermore, while features may function in certain combinations as described above and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and a claimed combination may refer to a sub-combination or a variation thereof.

[0226] Similarly, although the operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring these operations to be performed in the specific order shown or sequentially, or requiring all illustrated operations to be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Furthermore, the separation of various system modules and components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Claims

1. A target detection method, characterized in that, The method includes: Obtain a training dataset for training the object detection model; the training dataset includes original images; The target is detected from the original image using a pre-trained specified model, and candidate image patches containing the target are extracted from the original image. Based on the pixel values ​​of each pixel in the candidate image block, a feature map of the candidate image block is determined; the feature map includes at least one of a gradient saliency map for characterizing gradient differences between the pixels, a texture heterogeneity map for characterizing texture differences between the pixels, and a local contrast map for characterizing grayscale differences between the pixels. Based on the feature map, target sampling points are selected within the candidate image blocks; and target image blocks containing the target sampling points are extracted from the original image. The target image patch is super-resolution processed using the target detection model to obtain a super-resolution image patch; The target detection model is iteratively trained using the super-resolution image patch until the training termination condition is met, and the final target detection model is obtained. The step of selecting target sampling points within the candidate image block based on the feature map includes: A sampling grid is constructed on the candidate image block, and the grid points of the sampling grid are used as initial sampling points; the spacing between adjacent grid points in the sampling grid is a preset spacing. The feature map is subjected to specified processing to obtain a probability map corresponding to the candidate image block; the specified processing includes at least normalization processing; the probability map is composed of probability values ​​that correspond one-to-one with each pixel. The probability value of the probability map at the initial sampling point is taken as the target probability value; Generate standard Gaussian random numbers as offset coefficients for the initial sampling points; The target offset distance of the initial sampling point is the product of the target probability value, the offset coefficient, and the preset unit offset distance; the unit offset distance is less than the preset spacing. The initial sampling point is moved by the target offset distance along the specified direction to obtain the target sampling point.

2. The target detection method as described in claim 1, characterized in that, After moving the initial sampling point along the specified direction by the target offset distance to obtain the target sampling point, the method further includes: Determine the total number of sampling points for the target sampling point; The product of the preset coefficient and the total number of sampling points is used as the supplementary number of sampling points; From the candidate image block, select several first supplementary sampling points; The probability value of the probability map at the first supplementary sampling point is taken as the first supplementary probability value; Among the first supplementary sampling points, the supplementary sampling points whose first supplementary probability value is greater than the preset probability value are taken as the target sampling points to be added.

3. The target detection method as described in claim 1, characterized in that, After moving the initial sampling point along the specified direction by the target offset distance to obtain the target sampling point, the method further includes: Obtain a pre-constructed mapping relationship; the mapping relationship is the correspondence between the probability value corresponding to the sampling point and the sampling density; the probability value is positively correlated with the sampling density; the sampling density is the density of sampling points in an area centered on the sampling point with a preset length and a preset width; Based on the mapping relationship, the target sampling density corresponding to the target probability value is determined; Determine the actual sampling density of the target area where the target sampling point is located; If the actual sampling density is less than the target sampling density, then the sampling interval is determined based on the target sampling density; according to the sampling interval, a second supplementary sampling point is selected in the target area as an additional target sampling point.

4. The target detection method as described in claim 1, characterized in that, After moving the initial sampling point along the specified direction by the target offset distance to obtain the target sampling point, the method further includes: Based on the target probability value, a perturbation coefficient is determined; the perturbation coefficient is negatively correlated with the target probability value. The product of the disturbance coefficient and the preset unit disturbance distance is taken as the target disturbance distance; the unit disturbance distance is less than the preset spacing. The target sampling point is moved by the target perturbation distance along the specified direction to obtain the final target sampling point.

5. The target detection method as described in claim 1, characterized in that, The target detection model includes a backbone network, a neck network, and a detection head. The backbone network is used to extract features from the original image to generate original feature maps at different scales. The neck network is used to fuse features from the original feature maps. The detection head is used to generate target detection results based on the fused original feature maps. The target detection results include a predicted bounding box and a confidence score indicating the presence of a target within the predicted bounding box. The step of iteratively training the target detection model using the super-resolution image patch until the training termination condition is met, and obtaining the final target detection model, includes: Obtain a pre-constructed joint loss function; the joint loss function includes an object detection loss term, a confidence loss term, and a super-resolution reconstruction loss term; wherein, the object detection loss term is determined based on the degree of overlap between the predicted bounding box and the pre-labeled ground truth bounding box; the confidence loss term is the difference between the confidence score of the current training and the confidence score of the previous training; the super-resolution reconstruction loss term is the Euclidean distance between the super-resolution image patch and the target image patch; Based on the joint loss function, the target detection model is trained in segments until the joint loss function reaches its minimum value, thus obtaining the final target detection model.

6. The target detection method as described in claim 5, characterized in that, The super-resolution enhancement module includes a neighbor ensemble module, a multi-scale global context module, and a super-resolution module; the step of using the target detection model to perform super-resolution processing on the target image patch to obtain a super-resolution image patch includes: Obtain the original feature map; and crop out the candidate feature map corresponding to the candidate image patch and the target feature map corresponding to the target image patch from the original feature map; The candidate feature map and the original feature map are input into the neighbor integration module to enhance the original feature map and obtain an enhanced feature map; a local feature map corresponding to the candidate image block is cropped from the enhanced feature map; the local feature map and the candidate feature map are fused to obtain a first feature map; The candidate feature map and the target feature map are input into the multi-scale global context module so that the candidate feature map and the target feature map are fused to obtain a second feature map; The first feature map and the second feature map are input into the super-resolution module so that the first feature map and the second feature map are fused to obtain a third feature map, and the third feature map is subjected to feature extraction and upsampling to obtain the super-resolution image block.

7. A target detection device, characterized in that, The device includes: The dataset acquisition module is used to acquire a training dataset for training the object detection model; the training dataset includes original images; The candidate image patch determination module is used to detect targets from the original image using a pre-trained specified model, and to extract candidate image patches containing the targets from the original image; The feature map determination module is used to determine the feature map of the candidate image block based on the pixel values ​​of each pixel in the candidate image block; the feature map includes at least one of a gradient saliency map for characterizing the gradient difference between each pixel, a texture heterogeneity map for characterizing the texture difference between each pixel, and a local contrast map for characterizing the grayscale difference between each pixel. The target image block determination module is used to select target sampling points within the candidate image blocks based on the feature map; and to extract target image blocks containing the target sampling points from the original image. The super-resolution module is used to perform super-resolution processing on the target image patch using the target detection model to obtain a super-resolution image patch. The training module is used to iteratively train the target detection model using the super-resolution image patch until the training termination condition is met, and the final target detection model is obtained. The target image patch determination module includes: The initial sampling point determination module is used to construct a sampling grid on the candidate image patch and use the grid points of the sampling grid as initial sampling points; the spacing between adjacent grid points in the sampling grid is a preset spacing; The probability map determination module is used to perform specified processing on the feature map to obtain the probability map corresponding to the candidate image patch; the specified processing includes at least normalization processing; the probability map is composed of probability values ​​that correspond one-to-one with each pixel. The target probability value determination module is used to take the probability value of the probability map at the initial sampling point as the target probability value; The offset coefficient determination module is used to generate standard Gaussian random numbers as offset coefficients for the initial sampling points; The target distance determination module is used to take the product of the target probability value, the offset coefficient, and the preset unit offset distance as the target offset distance of the initial sampling point; the unit offset distance is less than the preset interval. The offset module is used to move the initial sampling point by a target offset distance along a specified direction to obtain the target sampling point; The target image patch extraction module is used to extract target image patches containing target sampling points from the original image.

8. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the target detection method as described in any one of claims 1 to 6 when executing the computer program.

9. A storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the steps of the target detection method as described in any one of claims 1 to 6.