A weakly supervised data set labeling method, device, labeling equipment and storage medium

By converting the point annotation information of the original image sequence into a spatial response heatmap and constructing a spatial weight matrix, and combining it with the C2f module of the YOLOv8 network to extract features, the problem of low quality of small target pseudo-box generation is solved, and efficient weakly supervised dataset annotation is achieved.

CN122244622APending Publication Date: 2026-06-19WUHAN INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN INST OF TECH
Filing Date
2026-05-14
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing weakly supervised object detection methods produce low-quality false bounding boxes for small objects, requiring manual annotation and resulting in low annotation efficiency.

Method used

By acquiring the point annotation information of the original image sequence, converting it into a spatial response heatmap and constructing a spatial weight matrix, combining it with the C2f module of the YOLOv8 network to extract features, performing feature fusion and weighted summation to generate an enhanced composite feature map, and finally performing spatially aware pooling to generate pseudo-box annotation results.

Benefits of technology

It improves the generation quality of small target pseudo-boundaries, reduces the need for manual annotation, and improves the efficiency of annotation on weakly supervised datasets.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244622A_ABST
    Figure CN122244622A_ABST
Patent Text Reader

Abstract

This invention relates to a method, apparatus, annotation device, and storage medium for labeling weakly supervised datasets, belonging to the field of computer vision technology. The method includes: acquiring an original image sequence; converting point annotation information in the original image sequence into a spatial response heatmap and constructing a spatial weight matrix based on the spatial response heatmap; extracting P3, P4, and P5 layer features from the original image sequence and performing feature fusion on the aligned P3 layer features; determining dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map and performing a weighted summation of the P3, P4, and P5 layer features based on the dynamic weight coefficients; performing spatially perceptual pooling on the enhanced composite feature map to obtain a global perceptual feature map, and using the global perceptual feature map as input to P2Bnet to obtain the pseudo-boundary annotation result output by P2Bnet. This invention improves the efficiency of labeling weakly supervised datasets.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a weakly supervised dataset annotation method, apparatus, annotation device, and storage medium. Background Technology

[0002] With the rapid development of artificial intelligence technology, object detection has been widely used in fields such as intelligent monitoring, industrial inspection, and remote sensing image analysis. However, high-performance object detection models usually rely on large-scale, high-precision bounding box annotation. This fully supervised annotation method is not only time-consuming and labor-intensive, but also extremely costly for massive amounts of data.

[0003] To reduce annotation costs, weakly supervised object detection methods based on point annotation (P2BNet) have become a research hotspot. These methods only require manual clicking on the center of the target, and the algorithm automatically generates pseudo-boundaries for model training.

[0004] However, existing point annotation methods (native P2BNet) typically use ResNet as the backbone network for feature extraction. Because ResNet suffers from severe feature information loss during deep downsampling, especially when dealing with small objects where edge textures become blurred, the generated pseudo-boxes exhibit poor localization accuracy and inaccurate boundary fitting. This necessitates extensive manual correction of the generated pseudo-boxes in practical industrial applications, failing to fundamentally solve the problem of low annotation efficiency.

[0005] Therefore, how to improve the quality of small target pseudo-boundary generation, reduce manual annotation costs, and thus improve annotation efficiency in the process of weakly supervised target detection has become an urgent technical problem to be solved. Summary of the Invention

[0006] In view of this, it is necessary to provide a weakly supervised dataset annotation method, apparatus, annotation device and storage medium to solve the problem that the generation quality of small target pseudo-boundaries in the existing weakly supervised target detection process is low, requiring manual annotation and thus resulting in low efficiency.

[0007] To address the aforementioned problems, in a first aspect, this invention provides a weakly supervised dataset annotation method, comprising: Obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap; The C2f module based on the YOLOv8 network extracts P3, P4 and P5 features from the original image sequence, and performs feature fusion on the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map. The dynamic weighting coefficients are determined based on the statistical energy of each channel dimension in the recalibrated feature map, and the P3 layer features, P4 layer features and P5 layer features are weighted and summed based on the dynamic weighting coefficients to obtain the enhanced composite feature map. Spatial-aware pooling is performed on the enhanced composite feature map to obtain a global-aware feature map. This global-aware feature map is then used as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

[0008] In one possible implementation, the conversion of point annotation information in the original image sequence into a spatial response heatmap based on a Gaussian kernel function includes: The coordinates of point labels in the original image sequence are converted into a spatial response heatmap with the same resolution as the original image sequence based on the Gaussian kernel function.

[0009] In one possible implementation, constructing the spatial weight matrix based on the spatial response heatmap includes: The spatial response heatmap is convolved and downsampled sequentially to obtain a spatial weight matrix with the same feature scale as the features of layer P3.

[0010] In one possible implementation, the feature fusion based on the spatial weight matrix of the feature-aligned P3 layer features includes: Convolution and normalization are performed on the P3 layer features to obtain the P3 layer features after feature alignment; The spatial weight matrix is ​​multiplied pixel-by-pixel with the P3 layer features after feature alignment, and the result of pixel-by-pixel multiplication is convolved and non-linearly activated to obtain the recalibrated feature map.

[0011] In one possible implementation, determining the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map includes: Global average pooling is performed on the recalibrated feature map to determine the statistical energy of each channel dimension in the recalibrated feature map; Energy feature vectors are constructed based on the statistical energy of each channel dimension in the recalibrated feature map. The energy feature vectors are then subjected to fully connected processing and normalization to determine the dynamic weight coefficients.

[0012] In one possible implementation, the weighted summation of the features at layer P3, layer P4, and layer P5 based on dynamic weight coefficients includes: Upsample the features of layer P4 and layer P5 respectively, so that the upsampled features of layer P4 and layer P5 have the same resolution as the features of layer P3. Based on dynamic weighting coefficients, the features of layer P3, the upsampled features of layer P4, and the features of layer P5 are weighted and summed to obtain an enhanced composite feature map.

[0013] In one possible implementation, the step of using the globally perceived feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet includes: The global awareness feature map is used as input to P2Bnet, and multiple iterations of optimization are performed within P2Bnet. The candidate bounding box with the highest score for each target is used as the pseudo-boundary annotation result.

[0014] On the other hand, the present invention also provides a weakly supervised dataset annotation device, comprising: The module is used to obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap. The fusion module is used by the C2f module based on the YOLOv8 network to extract P3, P4 and P5 features from the original image sequence, and to fuse the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map. The summation module is used to determine the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map, and to perform weighted summation on the features of layer P3, layer P4 and layer P5 based on the dynamic weight coefficients to obtain the enhanced composite feature map. The annotation module is used to perform spatially aware pooling on the enhanced composite feature map to obtain a global aware feature map, and then use the global aware feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

[0015] Secondly, the present invention also provides a labeling device, including a memory and a processor, wherein, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the weakly supervised dataset annotation method described in any of the above implementations.

[0016] Thirdly, the present invention also provides a computer-readable storage medium for storing a computer-readable program or instructions, which, when executed by a processor, can implement the steps in the weakly supervised dataset annotation method described in any of the above implementations.

[0017] The beneficial effects of this invention are as follows: The weakly supervised dataset annotation method, apparatus, annotation device, and storage medium provided by this invention first convert the point annotation information in the original image sequence into a spatial response heatmap, and construct a spatial weight matrix through the spatial response heatmap to provide a basis for subsequent feature space recalibration. Then, the C2f module extracts P3, P4, and P5 layer features from the original image sequence. The spatial weight matrix is ​​then used to fuse the feature-aligned P3 layer features to obtain a recalibrated feature map, achieving spatial recalibration of features and thus improving feature quality. Next, the dynamic weight coefficients are determined by the statistical energy of each channel dimension in the recalibrated feature map, and then... The state weight coefficients are used to perform a weighted summation of the features at layers P3, P4, and P5 to obtain an enhanced composite feature map, achieving cross-scale feature fusion, enhancing the geometric information of the feature map, and preserving more details of small objects. Finally, spatially perceptual pooling is performed on the enhanced composite feature map to suppress background noise, resulting in a globally perceptual feature map. This globally perceptual feature map is then used as input to P2Bnet to obtain the pseudo-box annotation results output by P2Bnet, realizing the annotation of weakly supervised datasets. This invention improves feature quality by preserving more details at the feature level, thereby improving the generation quality of small object pseudo-boxes, avoiding subsequent manual annotation, and improving the efficiency of weakly supervised dataset annotation. Attached Figure Description

[0018] Figure 1 A schematic diagram of an embodiment of the weakly supervised dataset annotation method provided by the present invention; Figure 2 A schematic diagram of an embodiment of the overall architecture for weakly supervised dataset annotation provided by the present invention; Figure 3 A schematic diagram of an embodiment of the weakly supervised dataset annotation network model architecture provided by the present invention; Figure 4 A schematic diagram of an embodiment of the bridging module architecture provided by the present invention; Figure 5 A schematic diagram of an embodiment of the weakly supervised dataset annotation device provided by the present invention; Figure 6 This is a schematic diagram of an embodiment of the labeling device provided by the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0020] In the description of the embodiments of the present invention, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.

[0021] The terms "first," "second," etc., used in the embodiments of this invention are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a technical feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature.

[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0023] This invention provides a weakly supervised dataset annotation method, apparatus, annotation device, and storage medium, which are described below.

[0024] Figure 1 This is a schematic diagram of an embodiment of the weakly supervised dataset annotation method provided by the present invention, as shown below. Figure 1 As shown, weakly supervised dataset annotation methods include: S101. Obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap.

[0025] It should be noted that the weakly supervised dataset annotation method provided by this invention can be applied to image target detection scenarios, especially image target detection scenarios containing weakly supervised (e.g., point annotation) information.

[0026] When labeling weakly supervised datasets, the labeling device (such as a desktop or portable computer) can first acquire the original image sequence containing point labeling information, then convert the point labeling information in the original image sequence into a spatial response heatmap using a Gaussian kernel function, and construct a spatial weight matrix using the spatial response heatmap to provide a basis for subsequent feature space recalibration.

[0027] S102. The C2f module based on the YOLOv8 network extracts P3, P4 and P5 features from the original image sequence, and performs feature fusion on the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map.

[0028] It should be noted that after constructing the spatial weight matrix, the P3, P4, and P5 layer features can be extracted from the original image sequence using the C2f module of the YOLOv8 network. Then, the P3 layer features after feature alignment are fused using the spatial weight matrix to obtain a recalibrated feature map, thereby achieving spatial recalibration of the features and improving feature quality.

[0029] S103. Based on the statistical energy of each channel dimension in the recalibrated feature map, determine the dynamic weight coefficients, and then perform a weighted summation of the P3 layer features, P4 layer features, and P5 layer features based on the dynamic weight coefficients to obtain the enhanced composite feature map.

[0030] It should be noted that after obtaining the recalibrated feature map, the dynamic weight coefficients can be determined by the statistical energy of each channel dimension in the recalibrated feature map. Then, the P3 layer features, P4 layer features and P5 layer features are weighted and summed using the dynamic weight coefficients to obtain the enhanced composite feature map, thereby achieving cross-scale fusion of features and enhancing the geometric information of the feature map.

[0031] S104. Perform spatially perceptual pooling on the enhanced composite feature map to obtain a global perceptual feature map, and use the global perceptual feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

[0032] It should be noted that: Finally, spatially aware pooling can be performed on the enhanced composite feature map to suppress background noise and obtain a globally aware feature map. Then, the globally aware feature map is used as input to P2Bnet to obtain the pseudo-box annotation results output by P2Bnet, thus realizing the annotation of the weakly supervised dataset.

[0033] In summary, the weakly supervised dataset annotation method provided in this embodiment of the invention first converts the point annotation information in the original image sequence into a spatial response heatmap, and constructs a spatial weight matrix using the spatial response heatmap to provide a basis for subsequent feature space recalibration. Then, the C2f module extracts P3, P4, and P5 features from the original image sequence. The spatial weight matrix is ​​then used to fuse the aligned P3 layer features to obtain a recalibrated feature map, achieving spatial recalibration of features and thus improving feature quality. Next, the dynamic weight coefficients are determined by the statistical energy of each channel dimension in the recalibrated feature map, and then the dynamic weight coefficients are used to adjust the P3 layer features. The features from layers 3, 4, and 5 are weighted and summed to obtain an enhanced composite feature map, achieving cross-scale feature fusion, enhancing the geometric information of the feature map, and preserving more details of small objects. Finally, spatially perceptual pooling is performed on the enhanced composite feature map to suppress background noise, resulting in a globally perceptual feature map. This globally perceptual feature map is then used as input to P2Bnet to obtain the pseudo-box annotation results output by P2Bnet, realizing the annotation of weakly supervised datasets. This invention improves feature quality by preserving more details at the feature level, thereby improving the generation quality of pseudo-boxes for small objects, avoiding subsequent manual annotation, and improving the efficiency of weakly supervised dataset annotation.

[0034] In some embodiments of the present invention, the conversion of point annotation information in the original image sequence into a spatial response heatmap based on a Gaussian kernel function includes: The coordinates of point labels in the original image sequence are converted into a spatial response heatmap with the same resolution as the original image sequence based on the Gaussian kernel function.

[0035] It should be noted that when converting the point annotation information in the original image sequence into a spatial response heatmap using the Gaussian kernel function, the point annotation coordinates in the original image sequence can be converted into a spatial response heatmap with the same resolution as the original image sequence.

[0036] In some embodiments of the present invention, the construction of a spatial weight matrix based on a spatial response heatmap includes: The spatial response heatmap is convolved and downsampled sequentially to obtain a spatial weight matrix with the same feature scale as the features of layer P3.

[0037] It should be noted that when constructing the spatial weight matrix based on the spatial response heatmap, the spatial response heatmap can be convolved and downsampled sequentially to obtain a spatial weight matrix with the same feature scale as the features of layer P3.

[0038] In some embodiments of the present invention, the feature fusion based on the feature-aligned P3 layer features using a spatial weight matrix includes: Convolution and normalization are performed on the P3 layer features to obtain the P3 layer features after feature alignment; The spatial weight matrix is ​​multiplied pixel-by-pixel with the P3 layer features after feature alignment, and the result of pixel-by-pixel multiplication is convolved and non-linearly activated to obtain the recalibrated feature map.

[0039] It should be noted that when performing feature fusion on the P3 layer features after feature alignment based on the spatial weight matrix, the P3 layer features can be convolved and normalized first to obtain the feature-aligned P3 layer features. Then, the spatial weight matrix is ​​multiplied pixel by pixel with the feature-aligned P3 layer features, and the result of the pixel-by-pixel multiplication is convolved and non-linearly activated to obtain the recalibrated feature map.

[0040] In some embodiments of the present invention, determining the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map includes: Global average pooling is performed on the recalibrated feature map to determine the statistical energy of each channel dimension in the recalibrated feature map; Energy feature vectors are constructed based on the statistical energy of each channel dimension in the recalibrated feature map. The energy feature vectors are then subjected to fully connected processing and normalization to determine the dynamic weight coefficients.

[0041] It should be noted that when determining the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map, global average pooling can be performed on the recalibrated feature map first to determine the statistical energy of each channel dimension in the recalibrated feature map. Then, an energy feature vector can be constructed based on the statistical energy of each channel dimension in the recalibrated feature map. Finally, the energy feature vector can be processed by full connection and normalization to determine the dynamic weight coefficients.

[0042] In some embodiments of the present invention, the weighted summation of the features at layer P3, layer P4, and layer P5 based on dynamic weight coefficients includes: Upsample the features of layer P4 and layer P5 respectively, so that the upsampled features of layer P4 and layer P5 have the same resolution as the features of layer P3. Based on dynamic weighting coefficients, the features of layer P3, the upsampled features of layer P4, and the features of layer P5 are weighted and summed to obtain an enhanced composite feature map.

[0043] It should be noted that when performing a weighted summation of the features at layers P3, P4, and P5 based on dynamic weighting coefficients, the features at layers P4 and P5 can be upsampled first to ensure that the upsampled features at layers P4 and P5 have the same resolution as the features at layer P3. Then, the features at layer P3, the upsampled features at layers P4, and the features at layer P5 are weighted and summed using dynamic weighting coefficients to obtain the enhanced composite feature map.

[0044] In some embodiments of the present invention, the step of using the globally perceived feature map as input to P2Bnet to obtain the pseudo-boundary annotation result output by P2Bnet includes: The global awareness feature map is used as input to P2Bnet, and multiple iterations of optimization are performed within P2Bnet. The candidate bounding box with the highest score for each target is used as the pseudo-boundary annotation result.

[0045] It should be noted that when using the global awareness feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet, multiple iterations of optimization can be performed within P2Bnet, and then the candidate annotation box with the highest score corresponding to each target can be used as the pseudo-boundary annotation result.

[0046] To address the shortcomings of existing technologies, this invention provides a weakly supervised target detection method based on feature adaptation and point guidance, which significantly improves the generation quality of small target pseudo-boundaries and reduces the secondary correction rate of manual annotation.

[0047] The specific steps of this method include: acquiring the original image sequence containing point annotation information and inputting it into a pre-built feature adaptation and point-guided weakly supervised model; generating high-fidelity pseudo-box annotations through an iterative regression mechanism, thereby constructing a candidate dataset under weak supervision guidance for manual consistency verification and fine-tuning.

[0048] Combination Figure 2 The feature adaptation and point-guided weakly supervised model includes a feature adaptation bridging module (FASE), a backbone network feature extraction module, a hierarchical feature pyramid enhancement module (HFP), a spatially aware pooling module (SPPF), and a pseudo-box iterative regression module.

[0049] Combination Figure 3 The feature adaptation bridging module is used to convert manually labeled discrete point coordinates into spatial response heatmaps and inject them as attention weights into the backbone network to achieve spatial recalibration of features, thereby guiding the model to accurately extract target edge information.

[0050] The backbone network feature extraction module is used to extract deep semantic features and fine-grained edge textures of images using multi-branch gradient flow with an integrated C2f structure.

[0051] The hierarchical feature pyramid enhancement module is used to achieve adaptive alignment and geometric information compensation of cross-scale features through bidirectional path aggregation and dynamic weight allocation.

[0052] The spatially perceptive pooling module is used to extract long-range correlation constraints between the target subject and local boundaries through multi-receptive field cascaded pooling in order to suppress background noise.

[0053] The pseudo-box iterative regression module is used to transform the enhanced composite features into boundary regression parameters and perform coarse-to-fine boundary refinement fitting with point labels as the core.

[0054] Combination Figure 4 The Feature Adaptation Bridging Module (FASE) consists of a feature alignment branch, a spatial mask branch, and a feature fusion layer, which is used to achieve deep coupling between point annotations and convolutional features.

[0055] The spatial masking branch consists of a Gaussian mapping layer and a spatial downsampling layer. This branch is used to transform the original point annotations into a weight matrix adapted to the feature map size. The Gaussian mapping layer receives the coordinates of the manually annotated points from the original image sequence. A two-dimensional spatial response heatmap G with the same resolution as the original image is generated using a Gaussian kernel function. Heatmap G is then fed into a spatial downsampling layer, where an s-convolutional layer with a stride of s is used to downsample G, reducing its spatial size from (H, W) to a size consistent with the backbone network features. Figure 1 The spatial weight matrix is ​​obtained by deriving (h,w). The calculation method for 's' is as follows: if the backbone network output is a P3 layer feature, then 's' is configured to 8; if the output is a P4 layer feature, then 's' is configured to 16; if the output is a P5 layer feature, then 's' is configured to 32. Downsampling is performed using this step size 's' to ensure the weight matrix... Spatial dimensions and feature diagrams Achieve pixel-level point-to-point mapping.

[0056]

[0057] Here, G represents the generated two-dimensional spatial response heatmap. It is a single-channel matrix with the same resolution as the original input image. The value of each pixel in the matrix represents the intensity of the signal radiation from the labeled point at that location, and its value ranges between [0,1].

[0058] This represents the set of manually labeled points. It contains the coordinates of the center or key points of all objects in the image and serves as the sole prior signal guiding the model in weakly supervised learning.

[0059] Represents a set The pixel coordinates of the i-th labeled point. During the calculation, it is the central reference of the Gaussian distribution.

[0060] (x, y) represents the coordinates of any pixel in the original image space. It is the independent variable in the formula, calculated by traversing all pixels in the image and its distance from the center point. Spatial distance.

[0061] exp(): Represents the exponential operator. Utilizing its nonlinear decay property, the response intensity decreases rapidly with increasing distance, thus creating a precise edge-guided gradient within a local region.

[0062] The scaling factor that controls the response radius. The value directly affects the model's tolerance to target size. In this invention, for the task of small target detection, It is usually set to a fixed constant of 3, but can also be adaptively adjusted according to the average distribution density of the target in the image.

[0063]

[0064] in, This represents the spatial weight matrix. It is a scale-aligned guiding signal whose spatial dimensions (h×w) are completely consistent with the convolutional feature maps extracted by the backbone network.

[0065] This represents the masked convolution operator. A set of 3×3 convolution kernels is used to apply to the original heatmap. Feature extraction is performed to smooth the Gaussian response and extract preliminary spatial features in preparation for subsequent attention injection.

[0066] The representation space downsampling operator scales the heatmap to be completely consistent with the feature map.

[0067] The feature alignment branch consists of a channel reduction layer in the first layer and a normalization layer in the second layer. This branch is used to receive the initial feature stream extracted by the backbone network (YOLOv8 Backbone). . The feature stream includes the output of the C2f module. Because the C2f module possesses rich gradient diversity, the FASE module can identify and enhance the coordinates of the points from complex feature channels through spatial gating. The edge signal with the highest spatial correlation.

[0068] The channel reduction layer uses a 1×1 two-dimensional convolutional unit to process the feature map output by the backbone network. Channel dimensionality reduction forces the model to feed the most critical edge signals into 64 highly representative channels after convolution. These channels are trained to carry the most salient edge signals related to the point label location (i.e., the target center). Injecting a Gaussian mask at this point further enhances the fit between weights and features, resulting in more accurate guidance and reduced computational redundancy in the MIL (Multi-Level Imaging). The second normalization layer uses LayerNorm to normalize the dimensionality-reduced features, improving feature stability.

[0069]

[0070] in, This represents the processed aligned feature map. It is the output of the feature alignment branch and possesses a low-dimensional, highly stable feature representation.

[0071] This represents the initial convolutional feature stream extracted by the backbone network, containing the original high-dimensional semantic information.

[0072] Channel reduction operator, 1×1 two-dimensional convolutional unit.

[0073] Normalization operator. In this invention, LayerNorm is preferred. It performs numerical normalization on the channel dimension within a single sample, aiming to enhance the numerical stability of features under different illumination or target scales.

[0074] The feature fusion layer consists of a multiplication operator layer and an output transformation layer. This layer is used to physically merge the information from the two branches mentioned above. Multiplication operator layer: converts the spatial mask branch output... The feature map output from the feature alignment branch is multiplied pixel-by-pixel to achieve spatial dimension attention injection; the output transformation layer uses a set of cascaded 3×3 convolutional layers and the SiLU activation function to perform non-linear reconstruction of the injected features, outputting the final recalibrated feature map. .

[0075]

[0076] in, This represents a pixel-wise multiplication operator that physically implements spatial gating of the gradient flow, thereby accurately extracting the physical edges of small targets from the C2f gradient flow.

[0077] It is a nonlinear activation function that provides nonlinear mapping capability, enabling recalibration of feature maps. The edge contrast is steeper, providing high-quality discriminative input for subsequent pseudo-box regression.

[0078] The Hierarchical Feature Pyramid Enhancement (HFP) module consists of two parts: dynamic factor prediction and weighted aggregation of cross-scale features. The HFP module receives recalibrated feature maps from the FASE module output. Furthermore, by combining the multi-scale feature layers P3, P4, and P5 output by the backbone network to perform collaborative enhancement, this module solves the technical problem of loss of geometric information of small targets under weak supervision through an adaptive weight allocation mechanism.

[0079] Dynamic factor prediction refers to the system's prediction of factors. The input is fed into the weight-aware branch built into the HFP module. This branch uses global average pooling (GAP) to extract... Significant edge energy in Then, the vector is processed through a fully connected layer (MLP). Linear weighting and activation processing are performed. Subsequently, normalization is applied to obtain three nonlinear weight coefficients. , , .

[0080] Global average pooling, as a compression operator for spatial features, Mean averaging is performed on each channel. The 3D feature map is compressed into a one-dimensional channel description vector, which is the saliency edge energy E.

[0081] Its mathematical expression is:

[0082] This formula describes the process by which the system extracts channel descriptors using global average pooling. Operator traversal of feature maps. For all pixels (i, j) in the spatial dimension, sum their activation values ​​and divide by the total area H. W obtains the mean. This process effectively eliminates interference from the target's spatial location, converging the scattered edge responses in space into statistical energy in the channel dimension, thereby extracting the edge energy vector E that can represent the global saliency of the target.

[0083] Global average pooling is performed because after spatial gating in the FASE module, non-target noise in the feature map is greatly suppressed, allowing the weak geometric gradients of small targets to converge numerically in specific high-dimensional channels. Therefore, energy is actually a set of physical representations reflecting the signal intensity of target edges in the feature map. It provides the most original, high signal-to-noise ratio input data for subsequent decisions.

[0084] The fully connected MLP layer receives the aforementioned edge energy E as input. It learns the nonlinear mapping between the edge energy distribution in the training data and the optimal fusion ratio, and then performs linear weighting and activation processing on the vector E. The calculation process is as follows:

[0085] The fully connected layer uses a learnable weight matrix W to extract deep features and perform nonlinear mapping on the channel energy E. In the formula, W... E realizes the feature interaction between channels, discovering which channels play a key role in the geometric restoration of small targets; the bias term b provides the basic mapping distribution; the ReLU activation function enhances the nonlinear expressiveness of the model and filters out invalid negative correlation signals, finally outputting a preliminary score Z reflecting the importance of features at each scale. Here, Z is a vector containing three values. . These represent the system's initial assessment scores of the importance of the three feature layers, P3, P4, and P5, respectively. When the edge energy E indicates that the underlying edge information (P3 layer) is relatively weak or urgently needs enhancement, the fully connected layer will automatically and significantly improve it through the mapping rules learned internally. The numerical value. This mechanism ensures that the underlying fine-grained geometric information dominates in subsequent cross-scale fusion, thereby accurately repairing the boundary details of the target under weak supervision.

[0086] Normalization is performed because the Z vector output by the fully connected layer has varying numerical ranges. To achieve stable feature fusion, the system uses the Softmax operator to normalize the initial scores, generating the final dynamic weight coefficients. .

[0087]

[0088] The normalized weights satisfy This ensures that the fusion process is actually an energy distribution among information at different scales. (Bottom layer weights): Corresponding to the features of layer P3, this layer has the highest resolution and the richest physical texture. (High-level weights): Corresponding to P4 and P5 layer features, they have stronger semantic abstraction capabilities.

[0089] For small target detection, when the edge response of E is weak, the system will automatically increase it. The numerical value. This on-demand compensation mechanism ensures that the fine-grained geometric information at the bottom layer can be accurately injected into the deep semantic layer, thus correcting the boundary ambiguity under weak supervision.

[0090] Weighted aggregation of cross-scale features is achieved through a scale alignment layer and a weighted summation operator. Dynamically predicted weight coefficients are used to physically stitch features from different abstraction levels.

[0091] In the scale alignment layer, since the feature layers P3, P4, and P5 output by the backbone network have different spatial resolutions (1 / 8, 1 / 16, and 1 / 32 of the original image, respectively), the system first uses the highest resolution layer P3 as a benchmark, and performs a 2x upsampling on P4 and a 4x upsampling on P5 using bilinear interpolation operators. By upsampling, the size of the deep features is restored, ensuring that the high-order semantic information can be aligned pixel-to-pixel with the high-resolution details of the bottom layer P3.

[0092] The system utilizes the dynamic weight coefficients output by the decision branch in the weighted summation operator. Perform element-wise weighted summation on the aligned feature tensors to generate an enhanced composite feature map. Its expression is:

[0093] This invention achieves feature information extraction through this dynamic weighted aggregation. For small targets that are difficult to capture in weakly supervised environments, the system increases the underlying weights. This forces the fine-grained edge gradients refined in the P3 layer to be injected into the global semantics. This not only compensates for the loss of geometric information caused by multiple downsampling in deep networks, but also utilizes the rich texture of P3 to repair the boundary details required for pseudo-box generation. P3 is the spatial baseline layer and does not perform upsampling.

[0094] The Spatial Aware Pooling Module (SPPF) follows the HFP module. Leveraging the boundary details already repaired by the HFP module, SPPF further enhances the spatial distribution constraints of these details. It is used to perform spatial constraint modeling in the horizontal dimension. The SPPF module consists of two layers: multi-receptive-field cascaded pooling and lateral aggregation of feature information.

[0095] Multi-receptive-field pooling refers to the SPPF module receiving the enhanced feature map output by the HFP and inputting it into a set of cascaded 5×5 max-pooling operators. The feature flow sequentially passes through three identical pooling layers, each further expanding the receptive field based on the previous layer. The model can capture the contrast relationship between the local edges of the target and the surrounding background, layer by layer, with point annotations as the core. This inside-out spatial scanning process provides a macroscopic geometric reference for subsequent prediction of the target's aspect ratio.

[0096] Lateral aggregation of feature information refers to the system combining the intermediate feature maps output by the three pooling layers with the original feature maps. The data is concatenated along the channel dimension, and then compressed and integrated using a 1×1 convolutional layer to output the final globally perceptual feature map. The splicing operation preserves the characteristic responses under different receptive fields. Original It provides precise, fine-grained details, while the features after triple pooling provide the overall outline of the target. Combining the two ensures that when the model generates pseudoboxes, it can accurately attach to the edge gradients purified by FASE, and is also constrained by the global context in the spatial dimension, effectively suppressing pseudobox coordinate drift or size distortion caused by background noise.

[0097] The pseudo-boundary iterative regression module receives the globally perceptual feature map output by the spatially perceptual pooling module (SPPF). Although this module references the iterative idea of ​​P2BNet in its regression mechanism, in this invention, its execution efficiency and accuracy are highly dependent on the saliency feature guidance provided by the aforementioned module.

[0098] The pseudo-boundary iterative regression module comprises three sub-modules: sub-module 1 grows initial pseudo-boundaries from points to surfaces; sub-module 2 performs coarse-to-fine iterative fitting based on enhanced features; and sub-module 3 outputs high-fidelity pseudo-boundary annotations. This ultimately generates high-precision, high-fidelity annotations, significantly reducing the workload of manual correction.

[0099] The initial pseudo-box growth module, which grows from points to surfaces, initializes a set of candidate boxes with different aspect ratios based on the saliency intensity of features within the neighborhood of a point. The system first enhances the feature map. Perform maximum projection along the channel dimension and combine it with local Gaussian smoothing to generate an activation response map that reflects the geometric contours of the target. .

[0100] In seed point Within a predefined search neighborhood centered on the target, a response threshold is defined. The set of salient pixels S = Based on this set, the system directly extracts the physical envelope span of the target using a spatial extremum search algorithm. The calculation formula is as follows:

[0101] To ensure that the initial bounding box set can automatically adjust its shape based on geometric cues in the feature map, the system establishes a mapping function from preset scale templates to the actual physical scale. The preset N sets of standard scale templates are... The final generated initial candidate box size Determined by the following formula:

[0102] in The preset baseline scaling factor; ratio and These constitute the scale correction operators in the horizontal and vertical directions, respectively.

[0103] The coarse-to-fine iterative fitting module based on enhanced features will refine the candidate box set generated by submodule 1. The system treats each candidate box as a packet, with each candidate box serving as an example. The system uses the RoI Align operator to extract the values ​​of each candidate box within the packet. The region features are used to calculate the example scores using the MIL classification head. .

[0104]

[0105] This score accurately reflects the coverage of the high-energy core region of the target by the candidate bounding box. (Selection) The highest candidate box is used as the iteration seed box. After entering the iteration, for the current pseudo-box... Prioritize calculating the coordinate optimization quantity E.

[0106]

[0107] The edge gradient field, representing the composite feature map, benefits from the P3 layer texture injected by the HFP module. This gradient field has extremely strong extremum characteristics at the target contour. This represents the cosine of the angle between the bounding box boundary normal and the feature gradient direction. The regression operator achieves decoupled optimization of scale scaling and center offset by performing independent gradient field response calculations on the four boundary branches of the box. When the initial box is located inside the target, each boundary is in the geometrically guided field. Under the influence of this force, an outward radial tension is generated, driving the prediction box to spontaneously expand. This is achieved by calculating four coordinate parameters. By using the partial derivatives under the energy function, the system can simultaneously correct for the size deviation and positional offset of the initial bounding box, ensuring that the regression trajectory can converge efficiently and stably to the physical boundary of the target.

[0108] Then the calculation system is used in the composite feature map. Calculate the spatial consistency partial derivatives of the gradients at the boundaries and edges.

[0109]

[0110] Due to texture injection in layer P3, the gradient energy outside the boundary is higher. (Partial derivative) and This generates an outward positive pulling force, instructing the frame to expand. The system calculates the increment... Perform parameter updates for this round:

[0111] When the number of iterations exceeds 12 or the iteration increment is less than the preset value The iteration ends when the time is right.

[0112]

[0113] After 12 rounds of iterative optimization, the output module for high-fidelity pseudo-boundary annotation selects the candidate bounding box with the highest score as the final generated pseudo-boundary annotation. The features are then stored in the candidate dataset. Through this closed loop from feature purification to iterative regression, this invention achieves the generation of high-fidelity annotations with full-supervised accuracy under weak supervision, significantly reducing the workload of manual correction.

[0114] This invention preserves more details of small targets at the feature level, and the generated pseudo-boundaries are an improvement over the native P2BNet, alleviating the pain points of missed detection of small targets and bounding box drift. This invention also significantly reduces the cost of manual annotation; since the generated initial pseudo-boundaries are close to the quality of fully supervised annotation, the frequency of manual secondary correction is reduced by more than 50%. When processing datasets of the same size, annotation efficiency is improved by approximately 3 times. With the help of multi-scale pooling in the SPPF module, the model can still accurately locate targets in complex noisy environments, making it highly valuable for industrial applications.

[0115] To better implement the weakly supervised dataset annotation method in this embodiment of the invention, based on the weakly supervised dataset annotation method, the corresponding method is as follows: Figure 5 As shown, this embodiment of the invention also provides a weakly supervised dataset annotation device, the weakly supervised dataset annotation device 500 comprising: Module 501 is used to obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap. The fusion module 502 is used to extract P3, P4 and P5 features from the original image sequence by the C2f module based on the YOLOv8 network, and to perform feature fusion on the P3 features after feature alignment based on the spatial weight matrix to obtain a recalibrated feature map. The summation module 503 is used to determine the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map, and to perform weighted summation on the features of layer P3, layer P4 and layer P5 based on the dynamic weight coefficients to obtain the enhanced composite feature map. The annotation module 504 is used to perform spatially aware pooling on the enhanced composite feature map to obtain a global aware feature map, and uses the global aware feature map as input to P2Bnet to obtain the pseudo-boundary annotation result output by P2Bnet.

[0116] The weakly supervised dataset annotation device 500 provided in the above embodiments can implement the technical solutions described in the above weakly supervised dataset annotation method embodiments. The specific implementation principles of each module or unit can be found in the corresponding content in the above weakly supervised dataset annotation method embodiments, and will not be repeated here.

[0117] like Figure 6 As shown, the present invention also provides a labeling device 600. The labeling device 600 includes a processor 601, a memory 602, and a display 603. Figure 6 Only some of the components of the labeled device 600 are shown, but it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0118] In some embodiments, processor 601 may be a central processing unit (CPU), microprocessor, or other data processing chip, used to run program code stored in memory 602 or process data, such as the weakly supervised dataset annotation method of the present invention.

[0119] In some embodiments, processor 601 may be a single server or a group of servers. The server group may be centralized or distributed. In some embodiments, processor 601 may be local or remote. In some embodiments, processor 601 may be implemented on a cloud platform. In one embodiment, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, internal cloud, multi-cloud, etc., or any combination thereof.

[0120] In some embodiments, memory 602 may be an internal storage unit of labeling device 600, such as a hard disk or memory of labeling device 600. In other embodiments, memory 602 may also be an external storage device of labeling device 600, such as a pluggable hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., provided on labeling device 600.

[0121] Furthermore, the memory 602 may include both internal storage units of the annotation device 600 and external storage devices. The memory 602 is used to store the application software and various types of data installed on the annotation device 600.

[0122] In some embodiments, display 603 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an organic light-emitting diode (OLED) touchscreen, etc. Display 603 is used to display information from the annotation device 600 and to display a visual user interface. Components 601-603 of the annotation device 600 communicate with each other via a system bus.

[0123] In one embodiment, when processor 601 executes the weakly supervised dataset annotation program in memory 602, the following steps can be implemented: Obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap; The C2f module based on the YOLOv8 network extracts P3, P4 and P5 features from the original image sequence, and performs feature fusion on the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map. The dynamic weighting coefficients are determined based on the statistical energy of each channel dimension in the recalibrated feature map, and the P3 layer features, P4 layer features and P5 layer features are weighted and summed based on the dynamic weighting coefficients to obtain the enhanced composite feature map. Spatial-aware pooling is performed on the enhanced composite feature map to obtain a global-aware feature map. This global-aware feature map is then used as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

[0124] It should be understood that when the processor 601 executes the weakly supervised dataset annotation program in the memory 602, in addition to the functions mentioned above, it can also perform other functions, as detailed in the description of the corresponding method embodiments above.

[0125] Furthermore, this embodiment of the invention does not specifically limit the type of the annotation device 600 mentioned. The annotation device 600 can be a portable electronic device such as a mobile phone, tablet computer, personal digital assistant (PDA), wearable device, or laptop computer. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices running iOS, Android, Microsoft, or other operating systems. The aforementioned portable electronic devices can also be other portable electronic devices, such as laptop computers with touch-sensitive surfaces (e.g., touch panels). It should also be understood that in some other embodiments of the invention, the annotation device 600 may not be a portable electronic device, but rather a desktop computer with a touch-sensitive surface (e.g., a touch panel).

[0126] Accordingly, this application also provides a computer-readable storage medium for storing a computer-readable program or instruction. When the program or instruction is executed by a processor, it can implement the steps or functions of the weakly supervised dataset annotation method provided in the above-described method embodiments.

[0127] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.), and the computer program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0128] The weakly supervised dataset annotation method, apparatus, annotation device, and storage medium provided by this invention have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this invention. Therefore, the content of this specification should not be construed as a limitation of this invention.

Claims

1. A weakly supervised dataset labeling method, characterized in that, include: Obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap; The C2f module based on the YOLOv8 network extracts P3, P4 and P5 features from the original image sequence, and performs feature fusion on the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map. The dynamic weighting coefficients are determined based on the statistical energy of each channel dimension in the recalibrated feature map, and the P3 layer features, P4 layer features and P5 layer features are weighted and summed based on the dynamic weighting coefficients to obtain the enhanced composite feature map. Spatial-aware pooling is performed on the enhanced composite feature map to obtain a global-aware feature map. This global-aware feature map is then used as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

2. The weakly supervised dataset labeling method of claim 1, wherein, The conversion of point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function includes: The coordinates of point labels in the original image sequence are converted into a spatial response heatmap with the same resolution as the original image sequence based on the Gaussian kernel function.

3. The weakly supervised dataset labeling method of claim 1, wherein, The construction of the spatial weight matrix based on the spatial response heatmap includes: The spatial response heatmap is convolved and downsampled sequentially to obtain a spatial weight matrix with the same feature scale as the features of layer P3.

4. The weakly supervised dataset labeling method of claim 1, wherein, The feature fusion based on the spatial weight matrix of the aligned P3 layer features includes: Convolution and normalization are performed on the P3 layer features to obtain the P3 layer features after feature alignment; The spatial weight matrix is ​​multiplied pixel-by-pixel with the P3 layer features after feature alignment, and the result of pixel-by-pixel multiplication is convolved and non-linearly activated to obtain the recalibrated feature map.

5. The weakly supervised dataset labeling method of claim 1, wherein, The determination of dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map includes: Global average pooling is performed on the recalibrated feature map to determine the statistical energy of each channel dimension in the recalibrated feature map; Energy feature vectors are constructed based on the statistical energy of each channel dimension in the recalibrated feature map. The energy feature vectors are then subjected to fully connected processing and normalization to determine the dynamic weight coefficients.

6. The weakly supervised dataset labeling method of claim 1, wherein, The weighted summation of features from layers P3, P4, and P5 based on dynamic weight coefficients includes: Upsample the features of layer P4 and layer P5 respectively, so that the upsampled features of layer P4 and layer P5 have the same resolution as the features of layer P3. Based on dynamic weighting coefficients, the features of layer P3, the upsampled features of layer P4, and the features of layer P5 are weighted and summed to obtain an enhanced composite feature map.

7. The weakly supervised dataset labeling method of claim 1, wherein, The process of using the globally perceived feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet includes: The global awareness feature map is used as input to P2Bnet, and multiple iterations of optimization are performed within P2Bnet. The candidate bounding box with the highest score for each target is used as the pseudo-boundary annotation result.

8. A weakly supervised dataset labeling apparatus, comprising: include: The module is used to obtain the original image sequence containing point annotation information, convert the point annotation information in the original image sequence into a spatial response heatmap based on the Gaussian kernel function, and construct a spatial weight matrix based on the spatial response heatmap. The fusion module is used by the C2f module based on the YOLOv8 network to extract P3, P4 and P5 features from the original image sequence, and to fuse the P3 features after feature alignment based on the spatial weight matrix to obtain the recalibrated feature map. The summation module is used to determine the dynamic weight coefficients based on the statistical energy of each channel dimension in the recalibrated feature map, and to perform weighted summation on the features of layer P3, layer P4 and layer P5 based on the dynamic weight coefficients to obtain the enhanced composite feature map. The annotation module is used to perform spatially aware pooling on the enhanced composite feature map to obtain a global aware feature map, and then use the global aware feature map as input to P2Bnet to obtain the pseudo-boundary annotation results output by P2Bnet.

9. A marking apparatus characterized by comprising: Including memory and processor, among which, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the weakly supervised dataset annotation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer-readable programs or instructions, which, when executed by a processor, can implement the steps in the weakly supervised dataset annotation method according to any one of claims 1 to 7.