Weakly supervised object detection method based on double-threshold heat map clustering candidate box and corresponding product
High-quality pseudo-labeled boxes are generated by using a dual-threshold heatmap clustering and classification score filtering method, which solves the problems of detection box integrity and discriminability in existing technologies and improves the accuracy and practicality of weakly supervised target detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN INSTITUTE OF TECHNOLOGY (SHENZHEN) (INSTITUTE OF SCIENCE AND TECHNOLOGY INNOVATION HARBIN INSTITUTE OF TECHNOLOGY SHENZHEN)
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing weakly supervised object detection methods struggle to simultaneously guarantee the integrity and discriminativeness of detection boxes during the pseudo-labeling stage, thus limiting the model's detection performance.
A pseudo-labeled box method based on dual-threshold heatmap clustering is used to generate pseudo-labeled boxes. The category-related heatmap is processed by high and low thresholds to construct a candidate cluster of pseudo-labeled boxes. High-quality pseudo-labeled boxes are then selected by combining the classification score matrix and used to train a weakly supervised object detection network.
It improves the accuracy and practicality of weakly supervised target detection, can more accurately locate object boundaries, reduce overfitting of discriminative localities of objects and missed detection or false merging of dense objects, and improves overall detection performance.
Smart Images

Figure CN122244413A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to a weakly supervised object detection method and corresponding product based on dual-threshold heatmap clustering candidate boxes. Background Technology
[0002] Object detection is one of the core tasks in computer vision, aiming to locate objects of interest in images and identify their categories. It is widely used in fields such as autonomous driving, intelligent surveillance, and medical image analysis. Traditional fully supervised object detection methods rely on large amounts of precisely labeled bounding box data, which is costly and time-consuming to acquire. To reduce labeling costs, weakly supervised object detection (WSOD) has emerged. It trains the model only under the supervision of image-level category labels (i.e., knowing only which objects are in the image, but not their specific locations), giving it both classification and localization capabilities.
[0003] Currently, mainstream weakly supervised object detection methods typically follow a paradigm of "multi-instance detection network generating candidate box scores → screening pseudo-labeled boxes → supervised optimization based on pseudo-labeled boxes." Clearly, the quality of the pseudo-labeled boxes directly determines the upper limit of the final model's performance. Existing techniques for selecting pseudo-labeled boxes mainly fall into two categories: The first type relies primarily on the classification confidence scores predicted by the model, for example, selecting the candidate boxes with the highest scores for each category as pseudo-labeled boxes. However, this type of method has a significant drawback: the model tends to focus on the most discriminative local regions of the object (such as the head), resulting in the selected pseudo-labeled boxes failing to cover the object's complete outline. The second type of method utilizes category-related heatmaps to provide prior location information, obtaining connected components by thresholding the heatmap, and using their bounding rectangles as pseudo-labeled boxes. While this type of method can better cover the entire object, its drawback is that when there are spatially adjacent objects of the same type in the image, their connected components in the heatmap are prone to merging, causing the generated pseudo-labeled boxes to incorrectly merge multiple independent instances into one, failing to achieve accurate instance differentiation.
[0004] In summary, existing weakly supervised object detection methods struggle to simultaneously guarantee the integrity and discriminability of bounding boxes during the pseudo-labeling stage, which becomes a key bottleneck restricting further improvements in model detection performance. Summary of the Invention
[0005] This application provides a weakly supervised object detection method and corresponding product based on dual-threshold heatmap clustering candidate boxes. By generating pseudo-labeled boxes through dual-threshold heatmap clustering, it solves the problem that existing technologies cannot simultaneously guarantee the integrity of detection boxes and distinguish adjacent objects, thereby improving the overall performance of weakly supervised object detection.
[0006] On the one hand, this application provides a weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes, the method comprising: Step S1: Obtain category-related heatmaps of the input image for each target category; Step S2: For the category-related heatmap of each target category, apply high threshold and low threshold respectively, and construct one or more pseudo-label box candidate clusters based on the spatial positional relationship between the threshold processing result and the multiple candidate boxes pre-generated for the image; Step S3: For each candidate cluster of pseudo-labeled boxes, based on a classification score matrix, select a pseudo-labeled box from the candidate boxes contained in the candidate cluster; Step S4: Use the selected pseudo-labeled boxes to train the weakly supervised object detection network to obtain the weakly supervised object detection model; Step S5: Input the image to be detected into the weakly supervised target detection network model and output the target detection result.
[0007] On the other hand, this application provides a weakly supervised target detection device based on dual-threshold heatmap clustering candidate boxes, the device comprising: The acquisition module is used to acquire category-related heatmaps of the input image for each target category; The construction module is used to process the category-related heatmap of each target category by applying high threshold and low threshold respectively, and to construct one or more pseudo-label box candidate clusters based on the spatial positional relationship between the threshold processing result and multiple candidate boxes pre-generated for the image. The filtering module is used to filter out a pseudo-label box from the candidate boxes contained in each pseudo-label box candidate cluster based on a classification score matrix. The training module is used to train the weakly supervised object detection network using the selected pseudo-labeled boxes to obtain the weakly supervised object detection model. The inference module is used to input the image to be detected into the weakly supervised target detection model and output the target detection result.
[0008] Thirdly, this application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the steps of the technical solution of the weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes as described above.
[0009] Fourthly, this application provides a storage medium storing a computer program, which, when executed by a processor, implements the steps of the above-described weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes.
[0010] As can be seen from the technical solution provided in this application, on the one hand, by simultaneously applying high and low thresholds to the heatmap of each category, and utilizing the spatial relationship between the high-threshold boxes and the scaled low-threshold boxes to construct a candidate cluster of pseudo-labeled boxes, the high-threshold boxes help distinguish adjacent instances of the same type of object, while the low-threshold boxes can more completely cover the object region. By filtering from the set of candidate boxes (i.e., the candidate cluster) between the two, the solution of this application cleverly combines the location information and classification score information of the heatmap, and can effectively pre-select those high-quality candidate boxes that can both completely cover a single object and be distinguished from other instances. This provides a higher-quality pseudo-supervision signal that is closer to the real annotation for subsequent training. On the other hand, using the selected high-quality pseudo-labeled boxes, a weakly supervised object detection network that can output a score vector containing all categories for each candidate box is trained. The basic structure provides a more complete semantic representation of categories by explicitly modeling background classes, which is more aligned with the paradigm of fully supervised object detection. By using more accurate pseudo-labeled boxes to supervise this network, noise and error propagation during training can be effectively reduced, enabling the network to learn more robust feature representations. This results in more accurate and reliable object detection results for input images during the inference phase after training. Thirdly, by combining dual-threshold heatmap clustering with score selection within candidate clusters, this application can generate pseudo-labeled boxes that can both define the entire object and distinguish adjacent individuals, even using only image-level weak labels. This allows the finally trained model to more accurately locate object boundaries during testing and effectively reduce overfitting to discriminative localities and missed detections or false merging of dense objects, thereby improving the overall accuracy and practicality of object detection. In summary, the technical solution of this application solves the problem that existing technologies cannot simultaneously guarantee the integrity of detection boxes and distinguish adjacent objects by generating pseudo-labeled boxes through dual-threshold heatmap clustering, thus improving the overall performance of weakly supervised object detection. Attached Figure Description
[0011] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0012] Figure 1This is a flowchart of a weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes provided in an embodiment of this application; Figure 2 This is a diagram illustrating the overall architecture of the weakly supervised target detection method DANCE provided in this application embodiment. Figure 3 This is a visualization of the intermediate steps of the heatmap-guided candidate box filtering algorithm HGPS provided in the embodiments of this application; Figure 4 This is a schematic diagram illustrating the dependency relationship between the closest bounding boxes provided in the embodiments of this application; Figure 5 This is a schematic diagram provided by an embodiment of the present application to illustrate the semantic gap that may exist between the basic score matrix and the weighted score matrix in a traditional WSDDN network and the necessary optimization. Figure 6 This is a visual comparison of the detection performance of the conventional method (OICR) and the method (DANCE) provided in this application in the "human" category. Figure 7 This is a schematic diagram of the structure of the weakly supervised target detection device based on dual-threshold heatmap clustering candidate boxes provided in the embodiments of this application; Figure 8 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0013] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0014] In this specification, adjectives such as "first" and "second" are used only to distinguish one element or action from another, without necessarily requiring or implying any actual such relationship or order. Where circumstances permit, reference to an element or component or step (etc.) should not be construed as being limited to only one of the elements, components, or steps, but may be one or more of the elements, components, or steps, etc.
[0015] For ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn to actual scale.
[0016] Currently, mainstream weakly supervised object detection methods typically follow a paradigm of "multi-instance detection network generating candidate box scores → screening pseudo-labeled boxes → supervised optimization based on pseudo-labeled boxes." The quality of the pseudo-labeled boxes directly determines the upper limit of the final model's performance. Existing methods for screening pseudo-labeled boxes mainly fall into two categories: The first type relies primarily on the classification confidence scores predicted by the model, such as selecting the candidate boxes with the highest scores for each category as pseudo-labeled boxes. However, this type of method has a significant drawback: the model tends to focus on the most discriminative local regions of the object (such as the head), resulting in the selected pseudo-labeled boxes failing to cover the complete outline of the object. The second type of method utilizes the location prior information provided by category-related heatmaps, obtaining connected components by thresholding the heatmap and using their bounding rectangles as pseudo-labeled boxes. While this type of method can better cover the entire object, its drawback is that when there are spatially adjacent objects of the same type in the image, their connected components in the heatmap are prone to merging, causing the generated pseudo-labeled boxes to incorrectly merge multiple independent instances into one, failing to achieve accurate instance differentiation.
[0017] To address the aforementioned problems in existing technologies, this application proposes a weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes, the flowchart of which is attached. Figure 1 As shown, it mainly includes steps S1 to S5, which are detailed below: Step S1: Obtain category-related heatmaps of the input image for each target category.
[0018] Given an input image ,in, H and W These represent the height and width of the image, respectively. In one embodiment of this application, the acquisition of the category-related heatmap relies on weakly supervised semantic segmentation technology. Specifically, a pre-trained weakly supervised semantic segmentation network (e.g., an S2C network) is employed, which can learn from data with only image-level labels to generate a preliminary class activation map (CAM). It is a three-dimensional tensor, in which, C Represents the total number of target categories in the dataset (e.g., in the Pascal VOC dataset). ), and It is the size of the feature map output by the network, which is usually smaller than the size of the original image.
[0019] For each target category present in the image c (i.e., its image-level tags) Extract the corresponding channel feature maps. Then, bilinear interpolation is used to... Upsample back to the original image size to obtain Next, regarding After performing min-max normalization, the final category-related heatmap is obtained. :
[0020] in, This represents the pixel coordinates on the heatmap. After normalization... The value of each pixel is constrained to the range [0, 1], and its value directly reflects whether the pixel position belongs to the target category. c The activation intensity or probability. The closer the value is to 1, the stronger the target category at that location. c The higher the probability of a discriminative region, the lower the correlation; the closer the value is to 0, the lower the correlation. Category Correlation Heatmap It provides pixel-level spatial location prior information for each category for subsequent steps, serving as a key bridge connecting image-level semantics (what is) and instance-level localization (where is).
[0021] Step S2: For the category correlation heatmap of each target category, apply high and low thresholds respectively, and then compare the results with the image. I The spatial relationships between multiple pre-generated candidate boxes are used to construct one or more pseudo-label box candidate clusters.
[0022] Step S2 is the core of this application for generating high-quality pseudo-labeled boxes, called Heatmap-Guided Proposal Screening (HGPS). Its purpose is to select from a large set of pre-generated candidate boxes (e.g., approximately 2000 candidate boxes generated using methods such as MCG and Selective Search). In the process, high-quality candidate boxes with the potential to tightly enclose a single complete object are pre-selected and grouped into different clusters, with each cluster corresponding to a potential object instance. Specifically, for the category-related heatmap of each target category, high and low thresholds are applied respectively, and the results of the threshold processing are compared with those of the image. I The spatial relationships between the pre-generated candidate boxes, and the construction of one or more pseudo-label box candidate clusters, can be achieved through the following steps S2.1 to S2.4, detailed in the following description: Step S2.1: Apply high threshold and low threshold to the category-related heatmap respectively to obtain high threshold connected component mask and low threshold connected component mask.
[0023] For a target category c Category-related heatmap We set a high threshold for it. (For example, ) and a low threshold (For example, Heatmaps related to each category. Thresholding yields two binary masks:
[0024]
[0025] in, It is an indicator function; its value is 1 when the condition is true, and 0 otherwise. This represents a high-threshold connected component mask, which only preserves regions with very high activation values in the heatmap. These regions are usually the most discriminative parts of an object, but may not cover the object's edges. This represents a low-threshold connected component mask, which preserves a wider range of activation regions and better covers the overall outline of an object, but may cause the activation regions of spatially adjacent instances of the same type of object to stick together. For the process and effects of high and low threshold processing described above, please refer to... Figure 3 , Figure 3 This is a visualization of the intermediate steps of the HGPS algorithm provided in this application embodiment. The columns sequentially display the original image (column 1), the category-related heatmap (column 2), the high-threshold mask (column 3), the low-threshold mask (column 4), the high / low-threshold bounding boxes (column 5), and the final cluster of candidate bounding boxes (column 6). The high-threshold mask effectively distinguishes adjacent instances of the same type (e.g., two people), but fails to cover their entire bodies. While the low-threshold mask covers the object more completely, it merges adjacent instances. This intuitively illustrates the inherent limitations of the single-threshold method, providing the motivation for adopting a dual-threshold method in this application.
[0026] Step S2.2: Extract the closest bounding rectangle for each connected component mask, and use it as the high-threshold box and low-threshold box respectively.
[0027] For the high-threshold and low-threshold binary masks obtained in step S2.1, find all connected regions (i.e., connected blocks composed of pixels with a value of 1). For the high-threshold mask... For each connected region in the image, calculate its minimum bounding rectangle (i.e., the smallest rectangle that completely encloses the connected region with edges parallel to the image coordinate axes). This is called a high-thresholding box, and the set of high-thresholding boxes is denoted as . Here, ,in M This represents the number of high-threshold bounding boxes. Similarly, for low-threshold masks... For each connected region in the dataset, its minimum bounding rectangle is extracted and called the low-threshold box. The set of low-threshold boxes is denoted as . ,in NThis is the number of low-threshold bounding boxes. Because... Each high-threshold connected component must be contained within at least one low-threshold connected component. Although the area of a high-threshold bounding box is smaller, it can effectively distinguish different instances; the area of a low-threshold bounding box is larger, and it can more completely enclose the object.
[0028] Step S2.3: Based on the hierarchical relationship between high-threshold boxes and low-threshold boxes, determine the combination of box pairs used to construct candidate clusters of pseudo-labeled boxes.
[0029] Here, the dependency relationship between high-threshold boxes and low-threshold boxes is based on the inclusion relationship of connected component masks, rather than the simple inclusion relationship of bounding rectangles. Specifically, if the connected component corresponding to a high-threshold box is completely contained within the connected component corresponding to a low-threshold box at the pixel level, then the high-threshold box is said to belong to the low-threshold box. This is the logical basis for constructing candidate clusters. Specifically, based on the dependency relationship between high-threshold boxes and low-threshold boxes, the box pair combinations used to construct pseudo-annotation box candidate clusters are determined as follows: 1) to 3). 1) If a low threshold box If there are no higher threshold boxes belonging to it, then the lower threshold box is... It exists as an independent set of bounding boxes. This situation typically occurs when the thermal response of the object instance is relatively dispersed and no significant high-activation cores are formed.
[0030] 2) If a low threshold box It has one and only one high-threshold bounding box. Then the high threshold box and the low threshold box Form a box pair combination ( , ).
[0031] 3) If a low threshold box There are multiple (denoted as) (Number) high-threshold boxes belonging to it Then for each high threshold box Each with the low threshold box Forming multiple box pairs, i.e. , , ..., This corresponds to a scenario where a low-threshold bounding box contains multiple object instances (such as a flock of sheep), and each high-threshold bounding box indicates the core of an instance. For a clearer understanding of this "subordination" based on connected component inclusion rather than rectangular inclusion, please refer to [reference needed]. Figure 4This diagram illustrates the "subordination" between the closest bounding boxes provided in this application. It demonstrates that a high-threshold box may be spatially contained within the rectangle of a low-threshold box, but if its corresponding pixel connected component actually belongs to the connected component of another low-threshold box, then the subordination should be determined based on the connected component, not the rectangle. This precise definition is fundamental to ensuring the accuracy of subsequent candidate box clustering.
[0032] Step S2.4: For each box pair combination, all candidate boxes whose spatial location lies between the high-threshold box and the scaled low-threshold box in the combination are grouped into the same pseudo-label box candidate cluster.
[0033] For a given combination of frames Clustering operations can be accomplished through the following steps: scaling low-threshold bounding boxes, filtering candidate boxes, and constructing candidate clusters. 1) Scaling the low-threshold box: This involves enlarging the low-threshold box within a group of boxes. Specifically, this includes scaling the low-threshold box... The width and height are each multiplied by a scaling factor. r (For example, ), to obtain the magnified low-threshold bounding box The purpose of zooming in is to slightly expand the search range and ensure that high-quality candidate boxes that are close to the boundaries of the low-threshold boxes are captured.
[0034] 2) Filter candidate boxes and construct candidate clusters. That is, candidate boxes whose spatial location simultaneously meets the following two conditions are included in the pseudo-labeled box candidate cluster corresponding to the box pair combination: the intersection-union ratio (IU) with the high-threshold box is greater than the first threshold and the IU with the magnified low-threshold box is less than the second threshold. Specifically, this includes: selecting from all pre-generated candidate box sets... In the process, filter out those that are spatially located within the high threshold bounding box. and the enlarged low threshold box Candidate boxes between [the specified range]. The specific criterion is: for a candidate box... Its intersection-union ratio with the high-threshold bounding box And the intersection-union ratio with the magnified low-threshold bounding box More precisely, one can require... and .in, and This is a preset small positive number used to control the tightness. Physically, it means that the selected candidate boxes must have sufficient overlap with the high-threshold boxes (to ensure they encompass the core of the instance), but cannot completely exceed the enlarged low-threshold boxes (to ensure they don't deviate too far from the overall instance). Through this "squeezing" strategy, it is possible to accurately anchor candidate boxes that both cover the core discrimination region of the instance and are constrained by the complete outline of the instance, such as... Figure 3As shown in column 5, these selected candidate boxes (displayed in different colors) perfectly and independently surround each object instance, effectively overcoming the shortcomings of the single thresholding method. All candidate boxes satisfying the above spatial relationship, along with the high-threshold boxes... They are grouped into the same set, forming a pseudo-label candidate cluster. If the same candidate box simultaneously meets the screening criteria for multiple box pair combinations (i.e., it is in multiple "mezzanines"), a one-to-one allocation strategy is adopted, assigning it to the cluster with the largest intersection-union ratio of its high-threshold box, to ensure that each candidate box belongs to only one cluster, thereby ensuring that the pseudo-labeled boxes selected from each cluster are different from each other.
[0035] In step S2, each category is classified. c Several candidate clusters of pseudo-label boxes were generated. Each cluster corresponds to a potential object instance location implied by a heatmap, and the candidate boxes within the cluster are spatially close to the instance core while being constrained by the overall scope of the instance. Therefore, they have a high probability of being high-quality candidate boxes that can completely and accurately cover a single instance, which lays the foundation for the subsequent selection steps.
[0036] Step S3: For each pseudo-label candidate cluster, based on a classification score matrix, select a pseudo-label from the candidate boxes contained in the pseudo-label candidate cluster.
[0037] After step S2, candidate clusters of pseudo-labeled boxes are obtained for each target category. The goal of step S3 is to select the single most likely accurate candidate box from each candidate cluster as the pseudo-labeled box for subsequent supervised training of the network. Step S3 does not rely solely on spatial location, but introduces the semantic confidence of the model's prediction as the final decision criterion, achieving an organic combination of location pre-selection and score selection.
[0038] Specifically, for a given target category c Suppose that it produced after step S2 Q A candidate cluster of pseudo-labeled bounding boxes, denoted as... The classification score matrix is derived from an intermediate output of the currently trained weakly supervised object detection network. In one embodiment of this application, during the first iteration or initial stage of training, this matrix may be derived from the weighted score matrix output by the basic multi-instance detection network module. In subsequent training, when cascaded instance tuning modules exist, the score matrix output by the previous instance tuning module is used. .
[0039] As an embodiment of this application, for each pseudo-labeled box candidate cluster, selecting a pseudo-labeled box from the candidate boxes contained in the pseudo-labeled box candidate cluster based on a classification score matrix can be as follows: For each target category, obtain all corresponding pseudo-labeled box candidate clusters; for each pseudo-labeled box candidate cluster under the target category, obtain the scores of all candidate boxes in each pseudo-labeled box candidate cluster corresponding to the target category in the classification score matrix; select the candidate box with the highest score corresponding to the target category in each pseudo-labeled box candidate cluster as the pseudo-labeled box used for supervising network training. Specifically, the above operation of selecting pseudo-labeled boxes can be as follows: for the target category... c Each pseudo-label candidate cluster Obtain the target category corresponding to all candidate boxes in the cluster in the classification score matrix. c The score value. Assuming candidate clusters of pseudo-labeled boxes. Include L individual boxes Extract these bounding boxes from the classification score matrix in the target category. c The scores in each dimension constitute a set of scores. The cluster corresponding to the target category. c The candidate box with the highest score is selected as the pseudo-labeled box for supervising network training; that is, the candidate box corresponding to the maximum score is found from the score set.
[0040] This candidate box It was identified as a candidate cluster for the pseudo-label box. The generated pseudo-label box has the following category label: c Its confidence score is .
[0041] As can be seen from step S3 of the above embodiment, by competitively selecting within each relatively high-quality candidate cluster anchored by heatmap location information, and then based on the model's current classification confidence, inferior candidate boxes that may encompass multiple instances or background regions can be effectively filtered out. This is because candidate boxes that tightly surround a single complete object typically have higher classification confidence than those that encompass irrelevant regions or span multiple instances. The resulting set of pseudo-annotated boxes theoretically more closely approximates manually annotated true bounding boxes, providing high-quality supervision signals for subsequent training.
[0042] Step S4: Use the selected pseudo-labeled boxes to train the weakly supervised object detection network to obtain the weakly supervised object detection model.
[0043] The high-quality pseudo-labeled bounding box candidate clusters generated in step S2 are used to supervise the training of a specially designed multi-instance detection network. This network is a significant improvement over the traditional WSDDN architecture and is called the Weakly Supervised Basic Detection Network (WSBDN). The overall network architecture and data flow can be found in [reference needed]. Figure 2 The example diagram illustrates the overall architecture of the weakly supervised object detection method DANCE, showcasing the complete workflow and data flow from the input image, through heatmap generation and HGPS filtering, training of basic and cascaded detection networks, to the final output detection results. Figure 2 As shown, the weakly supervised object detection network includes a basic multi-instance detection network module and at least one cascaded instance tuning module. The figure clearly illustrates how the input image extracts features via the backbone network and pre-generates candidate boxes; how the features and candidate boxes are fed into the weakly supervised object detection base network (WSBDN) and the cascaded instance tuning module for processing; how the image generates a category-related heatmap through a heatmap extractor, and how pseudo-labeled boxes are filtered out using the aforementioned HGPS algorithm for supervising network training; and finally, how the trained network outputs detection results for the input image. Figure 2 The core modules of training and inference in the method of this application and their interaction are fully described.
[0044] The following is a description of the structure and training of the Weakly Supervised Object Detection Network (WSBDN): First, the weakly supervised object detection basic network module is configured to perform the following operations: generate a zero-class score feature vector and a zero-class weight feature vector for each candidate box; stack the zero-class score feature vectors and zero-class weight feature vectors of all candidate boxes to obtain a zero-class score feature matrix and a zero-class weight feature matrix; perform exponential normalization on the zero-class score feature matrix along the class dimension to obtain a zero-class score matrix, and perform exponential normalization on the zero-class weight feature matrix along the candidate box dimension to obtain a zero-class weight matrix; multiply the zero-class score matrix and the zero-class weight matrix element-wise to obtain a zero-weighted classification score matrix, and sum the zero-weighted classification score matrix along the candidate box dimension to obtain the image-level prediction score. Detailed explanation follows: 1) Feature Extraction and Mapping: The input image is processed by a CNN backbone network (such as VGG16) to extract features. For each pre-generated candidate box... A fixed-length feature vector is extracted from the corresponding region of the RoI through a pooling layer and two fully connected layers. Subsequently, The data is fed into two parallel fully connected layer branches: the classification score branch and the classification weight branch. 1.1) Classification score branch: Outputs a zeroth classification score feature vector. Note that the dimension of this vector is... C + 1, where, the first C The corresponding foreground category is dimensional, the first C The addition of a 1D background class is a significant improvement over WSDDN.
[0045] 1.2) Classification weight branch: Outputs a zeroth classification weight feature vector. Note that the dimension of this vector is also... C + 1, where, the first C The corresponding foreground category is dimensional, the first C The addition of a 1D background class is a significant improvement over WSDDN.
[0046] 2) Normalization: Stack the vectors of all candidate boxes into a matrix. .
[0047] right Performing exponential normalization along the category dimension (the second dimension) yields the zeroth category score matrix. ,in, This represents the normalized score of each candidate box across all categories (including background).
[0048] right Perform exponential normalization along the candidate box dimension (first dimension) to obtain the first weight matrix. ,in, This represents the weight of the contribution of different candidate boxes to each category.
[0049] 3) Generate image-level predictions: and Performing the Hadamard product (element-wise multiplication) yields the first weighted fractional matrix. Then, regarding Summing along the candidate bounding box dimensions yields the image prediction score at the candidate bounding box level. The score for each dimension This indicates that the network predicts that the image contains objects of category 1. c The overall confidence level of the object.
[0050] Training the WSBDN module includes the following loss calculations: 1. Image-level loss Image-level loss is calculated based on the image-level prediction score and the bounding box-level image labels. Here, the bounding box-level image labels are an extension of the original concept of image-level labels. A label vector is defined. For the target category present in the image c, This means that "there must exist a category in the image". c "Candidate box". Furthermore, force the setting of the background class label. This is used to represent the prior knowledge that "there must be background candidate boxes in the image." First image-level loss. Using binary cross-entropy loss:
[0051] in:
[0052] and
[0053] 2. Zero-class loss Using all candidate boxes from the pseudo-label candidate cluster obtained in step S2 as pseudo-label boxes, the first score matrix is... Calculate the first classification loss. This is a fine-grained supervision. For each pseudo-labeled bounding box candidate cluster constructed in step S2... All candidate boxes are treated as pseudo-labeled boxes, and their category labels are the categories corresponding to that cluster. c Then, calculate the intersection-union ratio (IU) of each candidate bounding box with all the pseudo-bounding boxes mentioned above, and record the highest IU and its corresponding pseudo-bounding box. Here, if a candidate bounding box... If the maximum intersection-union ratio (IU) among all pseudo-labeled bounding boxes is greater than or equal to 0.5, then it is encouraged to be included in the zero-class score matrix. The corresponding fraction vector The scores in the corresponding category of the matrix tend to be 1, while the scores in other categories tend to be 0; if the maximum intersection-union ratio is between 0.1 and 0.5, it is encouraged to be included in the zero-class score matrix. The corresponding fraction vector Scores for the background class tend to be close to 1, while scores for other classes tend to be close to 0; if the maximum intersection-union ratio (MUI) is less than 0.1, the candidate box is ignored. Since this is the initial training phase and there are no prior weights, unweighted cross-entropy loss is used.
[0054] in, Candidate boxes are selected based on pseudo-label boxes. The initial pseudo-label assigned (if it does not belong to any candidate cluster, then denote it as...) (To ignore the tag) It is the number of candidate boxes that have been assigned valid labels (not ignored).
[0055] 3. Class Zero Ignore Loss In step 2, for candidate boxes with a maximum intersection-union ratio (MUC) less than 0.1 (i.e., those that are ignored), their scores on categories that exist in the dataset but not in the image are encouraged to approach 0. The cross-entropy loss is denoted as:
[0056] in, This is the number of candidate boxes that were ignored. This refers to those categories that do not exist in the image level labels described in step S1. .
[0057] 4. Total Module Loss: The WSBDN module is trained using image-level loss, zeroth-class classification loss, and zeroth-class ignore loss. The total loss for this module is: By optimizing this loss through backpropagation, the WSBDN module is able to initially learn the feature representation and localization capabilities of objects. A key problem exists in the traditional WSDDN architecture: although the final weighted score matrix... Supervision can be performed using image-level loss, but the first score matrix is its foundation. The lack of clear instance-level supervision signals leads to a vague learning objective, which is related to... Significant semantic gaps exist between them, thus limiting model performance. This problem exists in... Figure 5 The semantic gap that may exist between the basic score matrix and the weighted score matrix in a traditional WSDDN network and the necessary optimization are intuitively illustrated in the embodiments of this application. Figure 5 (a) depicts the desired scenario, where candidate boxes that highly overlap with real objects score high, and those that are far away score low; while Figure 5 (b) illustrates a possible suboptimal solution where both candidate boxes score highly, but the image-level loss can still be satisfied after weight adjustment. This application utilizes candidate clusters directly on the first score matrix as described above. Apply zero-class loss And zero-class neglect loss The way to It provides clear instance-level optimization directions, effectively narrowing the semantic distance between the two matrices and laying the foundation for overall performance improvement.
[0058] Secondly, the training of the instance tuning module is described in detail: It can be cascaded after the WSBDN module. K One (for example, K = 3) The instance tuning module is used for cascading instance tuning. Here, the first... k ( k= 1, 2, 3, …) The instance tuning module is configured to: generate the 1st, 2nd, 3rd, …th instance for each candidate box. k Classification score vector.
[0059] 1. Module Structure: The first k Each instance tuning module receives features from the backbone network and previous modules, and processes each candidate box through a fully connected layer. Generate the first k Classification score feature vector Then, perform exponential normalization along the category dimension to obtain the result for... The k Classification score vector :
[0060] The key to the instance tuning module lies in the allocation of pseudo-labels and loss weighting, which includes steps S4.1 to S4.3: Step S4.1: Using the pseudo-label boxes filtered in step S3, assign a category label or ignore label to each candidate box.
[0061] This is a crucial step in generating training objectives for the current instance tuning module. It utilizes the list of pseudo-labeled boxes generated by the WSBDN module from the previous module. (Each box contains its spatial coordinates and category) and confidence score s This is used to add pseudo-labels to all candidate boxes.
[0062] The specific allocation rules are as follows: Set the first crossover ratio threshold. Second crossover threshold ,in, ,For example, ,For example, .
[0063] For a candidate box Calculate the intersection-union ratio (IUU) of the bounding box with all pseudo-bounding boxes, and take the maximum value. .
[0064] like Then the candidate box tags Set the category of the pseudo-label box with the largest IoU. This indicates that the candidate bounding box has sufficient overlap with the real object and is a positive sample.
[0065] like Then set This is a background-related tag. It indicates that the candidate box does not overlap much with the object and is likely the background.
[0066] like Then set This means ignoring the labels. This is because pseudo-labeled boxes themselves may be inaccurate. Therefore, even if a candidate box is spatially far from all pseudo-labeled boxes, it may not necessarily be classified as background (for example, in an image with five people standing side-by-side, no other object categories, and two pseudo-labeled boxes that frame the two rightmost people, a candidate box might frame the leftmost person. In this case, the intersection-over-union (IoU) ratio of this candidate box and the two pseudo-labeled boxes is 0, but it shouldn't be classified as background; instead, it should be classified as foreground, specifically "people"). Therefore, this candidate box is ignored in traditional training, and no supervision signal is applied to its score vector.
[0067] Step S4.2: The k Classification loss Calculate a classification loss based on the assigned label. In this context, for candidate boxes assigned foreground or background category labels, the loss weight is determined based on the confidence score of the pseudo-labeled box with the largest intersection-union ratio to that candidate box.
[0068] Calculate the first k Classification loss of individual instance tuning modules The binary cross-entropy loss of the current instance tuning module is weighted using the score matrix output by the previous module as weights.
[0069] The specific formula is as follows:
[0070] in: Indicates that the label satisfies The number of candidate boxes, It's the loss weight. For The candidate bounding box is set to the confidence score of the pseudo-labeled box with the largest intersection-union ratio (IU) with that candidate bounding box. This assigns greater learning weight to candidate boxes matched by high-confidence pseudo-labeled boxes, enhancing the role of reliable supervisory signals.
[0071] for The candidate boxes that are ignored are not included in the traditional loss.
[0072] Step S4.3: The k Classification ignores loss For candidate boxes that were assigned to ignore labels during training, calculate the loss that their prediction scores for categories not present in the image-level labels should approach zero.
[0073] This is a key training mechanism proposed in this application, designed to fully utilize the "negative sample" information provided by ignored candidate boxes. Although we are unsure of the true category of the ignored box, we can determine that it does not belong to any category not present in the image-level labels.
[0074] Specifically, in the above embodiments, for candidate boxes assigned with ignored labels during training, the loss that should approach zero for the prediction score of categories not present in the image-level labels can be achieved through the following steps S4.3.1 to S4.3.3: Step S4.3.1: For an ignored candidate box (Right now ), obtain its score vector output in the current instance tuning module, i.e. .
[0075] Step S4.3.2: From the score vector, extract the categories corresponding to those marked as non-existent in the image-level labels (i.e., Category c The fractional components of ).
[0076] Step S4.3.3: Based on the extracted fractional components, calculate a binary cross-entropy loss that causes these fractional components to approach zero. The formula is as follows:
[0077] in, R Indicates the number of candidate boxes. Indicates satisfaction The number of candidate boxes. This loss function forces ignored candidate boxes to output low confidence on all categories that are "definitely not present in the image", providing the model with additional, explicit optimization directions, which helps to accelerate convergence and improve feature discrimination.
[0078] Step S4.4: Add the classification loss to a negative deterministic loss to obtain the total loss of the current training module.
[0079] Therefore, the first k The total loss of the instance optimization module based on the HGPS sieve method is:
[0080] Ultimately, the overall training loss of the entire network The loss consists of the sum of the losses from WSBDN and the losses from all instance tuning modules:
[0081] Through end-to-end optimization All parameters of the network can be trained together, ultimately resulting in a high-performance weakly supervised object detection model.
[0082] Step S5: Input the image to be detected into the weakly supervised object detection model and output the object detection result.
[0083] After the model completes training, it enters the pure inference or application phase. The goal of this phase is to use the trained network to perform fast and accurate object detection on new, unlabeled images. The specific operations of the inference phase include: using the last instance tuning module in the trained weakly supervised object detection network to generate class prediction scores for each candidate box in the image to be detected; assigning a final class to each candidate box based on the class prediction scores, and generating object detection boxes.
[0084] 1. Network Forward Propagation: An image to be detected is directly input into the already trained "weakly supervised object detection model". The model will sequentially perform a series of forward computation processes, including feature extraction (through the backbone network), preliminary detection and weighting (through the WSBDN module), and multi-level refinement (through cascaded instance tuning modules).
[0085] 2. Obtaining the final prediction score: During the training phase, cascaded instance tuning modules are used sequentially to refine pseudo-labels and features. During inference, the last one (i.e., the first...) is intentionally selected. K The score matrix output by the instance tuning module (individual) This serves as the network's final prediction for the current image. This is because this module has undergone the most rounds of refined training guided by high-quality pseudo-labeled boxes, resulting in the most mature and stable feature discrimination and classification capabilities. Matrix elements in That is to say, the first r candidate boxes Category c Normalized confidence (including background).
[0086] 3. Decoding to generate preliminary detection boxes: For each pre-generated candidate box in the image... Perform the following operations: determine the category, obtain confidence level, and perform preliminary filtering: 3.1) Determine the category: Take its score vector The index with the largest median value is used as its predicted category: .
[0087] 3.2) Obtaining Confidence: The score of the predicted category is the detection confidence. .
[0088] 3.3) Preliminary filtering: Apply a low confidence threshold (e.g., ).like It is a background class ( C + 1) or If the candidate box is not found, it will be filtered out directly and not included in the output. This step can quickly remove a large number of obvious negative samples.
[0089] 4. Generate final detection results: This involves generating all predicted bounding boxes retained after initial filtering. Group by category and perform the following operations for each category: Non-maximum suppression: For all predicted boxes of this category, based on their confidence level... Sort the boxes in descending order. Select the boxes with the highest confidence levels and remove all boxes whose intersection-union ratio (IU) exceeds a set threshold (e.g., ...). The process involves identifying boxes with low confidence levels. This process is repeated until all boxes have been identified. This step aims to eliminate redundant detection boxes for the same object, retaining only the optimal one.
[0090] Output list: The remaining predicted bounding boxes after NMS processing constitute the final detection results for this type of object in the image. Each result includes bounding box coordinates, class label, and confidence score.
[0091] 5. Output: The final detection results of all categories are summarized to obtain the complete object detection results of the input image, which can be output for visualization, evaluation or downstream tasks.
[0092] Thus, from training to inference, the "Weakly Supervised Object Detection Method Based on Dual-Threshold Heatmap Clustering Candidate Boxes" described in this application forms a complete, closed-loop, and efficient technical process. The training phase utilizes innovative dual-threshold heatmap clustering and negative deterministic supervision to obtain a high-quality model, while the inference phase directly leverages the model's powerful discriminative ability to achieve accurate object detection.
[0093] The technical effects of this application have been fully verified through qualitative and quantitative experiments.
[0094] For qualitative visualization, please refer to [reference needed]. Figure 6 The image shows a visual comparison of the detection performance of the conventional method (OICR) and the method (DANCE) provided in this application for the "human" category, intuitively demonstrating that the method of this application can generate a more complete and accurate target bounding box. Figure 6 As can be seen, traditional OICR methods are affected by human body color and texture segmentation, and their detection boxes often only lock the most discriminative local regions such as the head and torso. In contrast, the detection boxes generated by the method in this application can more completely and accurately cover the entire human body contour. This directly proves that the "Dual Threshold Heatmap Clustering Candidate Box" strategy (HGPS) proposed in this application can effectively encourage the model to focus on the complete spatial range of the object, rather than just focusing on discriminative local areas, thus solving the typical problem of "local focus" in weakly supervised object detection.
[0095] For quantitative performance evaluation, this application was fully tested on the publicly available benchmark dataset Pascal VOC, and the results are recorded in Tables 1, 2 and 3 below.
[0096] Table 1 compares the AP (%) of DANCE with other methods on the Pascal VOC 2007 test set.
[0097]
[0098] Table 2 compares DANCE with other methods on the Pascal VOC 2007 test set in terms of CorLoc (%).
[0099]
[0100] Tables 1 and 2 above detail the average accuracy (AP) of our method (DANCE) and many existing mainstream methods on various categories and the correct localization rate (CorLoc) on the training set on the Pascal VOC 2007 test set. Table 3 summarizes the overall performance comparison on the more challenging Pascal VOC 2012 test set.
[0101] Table 3 compares DANCE with other methods on the Pascal VOC 2012 test set in terms of AP (%) and CorLoc (%).
[0102]
[0103] These data consistently demonstrate that, regardless of whether the setting uses only the classification branch or the enhanced setting incorporates the Fast R-CNN detector head for bounding box regression, and regardless of whether traditional MCG candidate boxes are used or candidate boxes generated by the more advanced SAM model are used, the proposed method significantly outperforms all existing techniques listed in the table in both the mean accuracy (mAP) and mean localization accuracy (mCorLoc), the two core evaluation metrics. Particularly in the "person" category, the proposed method achieves a particularly significant improvement in AP due to its ability to generate more complete bounding boxes. These objective and reproducible experimental data strongly demonstrate the effectiveness and advancement of this application in improving the overall performance of weakly supervised object detection.
[0104] From the above appendix Figure 1As illustrated in the example of the weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes, on the one hand, by simultaneously applying high and low thresholds to the heatmap of each category, and utilizing the spatial relationship between the high-threshold boxes and the scaled low-threshold boxes to construct pseudo-labeled box candidate clusters, the high-threshold boxes help distinguish adjacent instances of the same type of object, while the low-threshold boxes can more completely cover the object region. By filtering from the set of candidate boxes (i.e., candidate clusters) between the two, the proposed scheme cleverly combines the location information and classification score information of the heatmap, and can effectively pre-select those high-quality candidate boxes that can both completely cover a single object and be distinguished from other instances. This provides higher-quality pseudo-supervisory signals that are closer to real annotations for subsequent training. On the other hand, the selected high-quality pseudo-labeled boxes are used to train a weakly supervised object detection network that can output a score vector containing all categories for each candidate box. First, the network's basic structure provides a more complete category semantic representation by explicitly modeling the background class, which is more aligned with the paradigm of fully supervised object detection. Supervising the network using more accurate pseudo-labeled boxes effectively reduces noise and error propagation during training, enabling the network to learn more robust feature representations. This results in more accurate and reliable object detection results for the input image during the inference phase after training. Second, by combining dual-threshold heatmap clustering with score filtering within candidate clusters, this application can generate pseudo-labeled boxes that can both define the entire object and distinguish adjacent individuals, even using only image-level weak labels. This allows the finally trained model to more accurately locate object boundaries during testing and effectively reduces overfitting to discriminative localities and missed detections or false merging of dense objects, thus improving the overall accuracy and practicality of object detection. In summary, the technical solution of this application solves the problem of existing technologies being unable to simultaneously guarantee the integrity of detection boxes and distinguish adjacent objects by generating pseudo-labeled boxes through dual-threshold heatmap clustering, thereby improving the overall performance of weakly supervised object detection.
[0105] Please see the appendix Figure 7 This application provides a weakly supervised object detection device based on dual-threshold heatmap clustering candidate boxes. The device may include an acquisition module 701, a construction module 702, a filtering module 703, a training module 704, and an inference module 705, as detailed below: The acquisition module 701 is used to acquire category-related heatmaps of the input image for each target category; The construction module 702 is used to process the category-related heatmap of each target category by applying high and low thresholds respectively, and to construct one or more pseudo-label box candidate clusters based on the spatial positional relationship between the threshold processing results and multiple candidate boxes pre-generated for the image. The filtering module 703 is used to filter out a pseudo-label box from the candidate boxes contained in each pseudo-label box candidate cluster based on a classification score matrix. Training module 704 is used to train the weakly supervised object detection network using the selected pseudo-labeled boxes to obtain a weakly supervised object detection model; The inference module 705 is used to input the image to be detected into the weakly supervised object detection model and output the object detection result.
[0106] From the above appendix Figure 7 As illustrated in the example of a weakly supervised object detection device based on dual-threshold heatmap clustering candidate boxes, on the one hand, by simultaneously applying high and low thresholds to the heatmap of each category, and utilizing the spatial relationship between the high-threshold boxes and the scaled low-threshold boxes to construct a cluster of pseudo-labeled boxes, the high-threshold boxes help distinguish adjacent instances of the same type of object, while the low-threshold boxes can more completely cover the object region. By filtering from the set of candidate boxes (i.e., the candidate cluster) between the two, the scheme of this application cleverly combines the location information and classification score information of the heatmap, and can effectively pre-select those high-quality candidate boxes that can both completely cover a single object and be distinguished from other instances. This provides a higher-quality pseudo-supervisory signal that is closer to the real annotation for subsequent training. On the other hand, the selected high-quality pseudo-labeled boxes are used to train a weakly supervised object detection network that can output a score vector containing all categories for each candidate box. First, the network's basic structure provides a more complete category semantic representation by explicitly modeling the background class, which is more aligned with the paradigm of fully supervised object detection. Supervising the network using more accurate pseudo-labeled boxes effectively reduces noise and error propagation during training, enabling the network to learn more robust feature representations. This results in more accurate and reliable object detection results for the input image during the inference phase after training. Second, by combining dual-threshold heatmap clustering with score filtering within candidate clusters, this application can generate pseudo-labeled boxes that can both define the entire object and distinguish adjacent individuals, even using only image-level weak labels. This allows the finally trained model to more accurately locate object boundaries during testing and effectively reduces overfitting to discriminative localities and missed detections or false merging of dense objects, thus improving the overall accuracy and practicality of object detection. In summary, the technical solution of this application solves the problem of existing technologies being unable to simultaneously guarantee the integrity of detection boxes and distinguish adjacent objects by generating pseudo-labeled boxes through dual-threshold heatmap clustering, thereby improving the overall performance of weakly supervised object detection.
[0107] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. For example... Figure 8As shown, the electronic device 8 in this embodiment mainly includes: a processor 80, a memory 81, and a computer program 82 stored in the memory 81 and executable on the processor 80, such as a program for a weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes. When the processor 80 executes the computer program 82, it implements the steps in the above-described embodiment of the weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes, for example... Figure 1 Steps S101 to S105 are shown. Alternatively, when processor 80 executes computer program 82, it implements the functions of each module / unit in the above-described device embodiments, for example... Figure 7 The functions of the acquisition module 701, construction module 702, filtering module 703, training module 704, and inference module 705 are shown.
[0108] For example, the computer program 82 of the weakly supervised object detection method based on dual-threshold heatmap clustering candidate boxes mainly includes: acquiring the category correlation heatmap of the input image for each target category; applying high and low thresholds to the category correlation heatmap of each target category respectively, and comparing the threshold processing results with the image... IThe spatial relationships between multiple pre-generated candidate boxes are used to construct one or more pseudo-labeled box candidate clusters. For each pseudo-labeled box candidate cluster, a pseudo-labeled box is selected from the candidate boxes contained in the pseudo-labeled box candidate cluster based on a classification score matrix. The selected pseudo-labeled box is used to train a weakly supervised object detection network to obtain a weakly supervised object detection model. The image to be detected is input into the weakly supervised object detection model, and the object detection result is output. The computer program 82 can be divided into one or more modules / units, one or more modules / units are stored in memory 81, and executed by processor 80 to complete this application. One or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of computer program 82 in electronic device 8. For example, computer program 82 can be divided into the functions of acquisition module 701, construction module 702, filtering module 703, training module 704, and inference module 705 (a module in the virtual device). The specific functions of each module are as follows: Teaming module 201 is used for multiple vehicle terminals to form a team and each independently play the accompaniment audio of the same song locally; acquisition module 701 is used to acquire the category correlation heatmap of the input image for each target category; construction module 702 is used to process the category correlation heatmap of each target category by applying high threshold and low threshold respectively. The system performs a thresholding process and constructs one or more pseudo-labeled box candidate clusters based on the spatial relationship between the thresholding result and multiple candidate boxes pre-generated for the image. A filtering module 703 is used to filter out one pseudo-labeled box from the candidate boxes contained in each pseudo-labeled box candidate cluster based on a classification score matrix. A training module 704 is used to train the weakly supervised object detection network using the filtered pseudo-labeled boxes to obtain a weakly supervised object detection model. An inference module 705 is used to input the image to be detected into the weakly supervised object detection model and output the object detection result.
[0109] Electronic device 8 may include, but is not limited to, processor 80 and memory 81. Those skilled in the art will understand that... Figure 8 This is merely an example of electronic device 8 and does not constitute a limitation on electronic device 8. It may include more or fewer components than shown, or combine certain components, or different components. For example, electronic device may also include input / output devices, network access devices, buses, etc.
[0110] The processor 80 may be a graphics processing unit (GPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0111] The memory 81 can be an internal storage unit of the electronic device 8, such as a hard disk or RAM. The memory 81 can also be an external storage device of the electronic device 8, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card. Furthermore, the memory 81 can include both internal and external storage units of the electronic device 8. The memory 81 is used to store computer programs and other programs and data required by the electronic device. The memory 81 can also be used to temporarily store data that has been output or will be output.
[0112] If integrated modules / units are implemented as software functional units and sold or used as independent products, they can be stored in a storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program for the weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes can be stored in a storage medium. When executed by a processor, this computer program can implement the steps of the various method embodiments described above, namely, acquiring category-related heatmaps of the input image for each target category; applying high and low thresholds to the category-related heatmaps of each target category respectively; and processing the heatmaps based on the threshold processing results and the image... IThe spatial relationships between multiple pre-generated candidate boxes are used to construct one or more pseudo-labeled box candidate clusters. For each pseudo-labeled box candidate cluster, a pseudo-labeled box is selected from the candidate boxes contained in the pseudo-labeled box candidate cluster based on a classification score matrix. The selected pseudo-labeled boxes are used to train the weakly supervised object detection network to obtain the weakly supervised object detection result. The image to be detected is input into the weakly supervised object detection model, and the object detection result is output. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or some intermediate form. The storage medium can include: any entity or device capable of carrying computer program code, recording media, USB flash drives, external hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the storage medium can be appropriately added or removed according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the storage medium does not include electrical carrier signals and telecommunication signals.
[0113] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. These modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application. The specific embodiments described above further illustrate the purpose, technical solutions, and beneficial effects of this application. It should be understood that the above descriptions are merely specific embodiments of this application and are not intended to limit the protection scope of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
Claims
1. A weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes, characterized in that, The method includes: Step S1: Obtain category-related heatmaps of the input image for each target category; Step S2: For the category-related heatmap of each target category, apply high threshold and low threshold respectively, and construct one or more pseudo-label box candidate clusters based on the spatial positional relationship between the threshold processing result and the multiple candidate boxes pre-generated for the image; Step S3: For each candidate cluster of pseudo-labeled boxes, based on a classification score matrix, select a pseudo-labeled box from the candidate boxes contained in the candidate cluster; Step S4: Use the selected pseudo-labeled boxes to train the weakly supervised object detection network to obtain the weakly supervised object detection model; Step S5: Input the image to be detected into the weakly supervised target detection network model and output the target detection result.
2. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 1, characterized in that, The heatmap for each target category is processed by applying high and low thresholds respectively, and one or more pseudo-labeled box candidate clusters are constructed based on the spatial relationship between the threshold processing results and multiple candidate boxes pre-generated for the image, including: Step S2.1: Apply the high threshold and the low threshold to the category-related heatmap respectively to obtain the high threshold connected component mask and the low threshold connected component mask; Step S2.2: Extract the closest bounding rectangle of each connected component mask, and use it as the high-threshold box and low-threshold box respectively; Step S2.3: Based on the hierarchical relationship between the high-threshold boxes and the low-threshold boxes, determine the box pair combinations used to construct the candidate cluster of pseudo-labeled boxes; Step S2.4: For each of the box pair combinations, all candidate boxes whose spatial locations lie between the high-threshold boxes and the scaled low-threshold boxes in the combination are grouped into the same pseudo-label box candidate cluster.
3. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 2, characterized in that, The step of determining the combination of box pairs used to construct candidate clusters of pseudo-labeled boxes based on the dependency relationship between the high-threshold boxes and the low-threshold boxes includes: If a low-threshold box has no subordinate high-threshold boxes, then the low-threshold box itself is treated as an independent pair of boxes. If a low-threshold box has one and only one high-threshold box that belongs to it, then the high-threshold box and the low-threshold box form a box pair. If a low-threshold box has multiple high-threshold boxes that belong to it, then each high-threshold box forms multiple box pairs with the low-threshold box.
4. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 3, characterized in that, For each of the box pairs, all candidate boxes whose spatial locations fall between the high-threshold boxes and the scaled low-threshold boxes in that pair are grouped into the same pseudo-label box candidate cluster, including: Enlarge the low-threshold boxes in a box pair combination; Candidate boxes whose spatial location simultaneously meets the following two conditions are included in the pseudo-label box candidate cluster corresponding to the box pair combination: the intersection-union ratio with the high-threshold box is greater than the first threshold, and the intersection-union ratio with the magnified low-threshold box is less than the second threshold.
5. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 1, characterized in that, The weakly supervised target detection network includes a basic multi-instance detection network module and at least one cascaded instance tuning module; The basic multi-instance detection network module is configured to: generate a zero-class score feature vector and a zero-class weight feature vector for each candidate box; stack the zero-class score feature vectors and zero-class weight feature vectors of all candidate boxes to obtain a zero-class score feature matrix and a zero-class weight feature matrix; normalize the zero-class score feature matrix along the category dimension to obtain a zero-class score matrix, and normalize the zero-class weight feature vector along the candidate box dimension to obtain a zero-class weight matrix; multiply the zero-class score matrix and the zero-class weight matrix element-wise to obtain a zero-weighted classification score matrix, and sum the zero-weighted classification score matrix along the candidate box dimension to obtain an image-level prediction score.
6. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 1, characterized in that, Step S4 further includes: Step S4.3: For candidate boxes that were assigned to ignore labels during training, calculate the loss that their prediction scores for categories not present in the image-level labels should approach zero.
7. The weakly supervised target detection method based on dual-threshold heatmap clustering candidate boxes according to claim 6, characterized in that, The step of calculating the loss for candidate boxes that were assigned ignored labels during training, where their prediction scores for categories not present in the image-level labels should tend to zero, includes: For a candidate box that is ignored, obtain its score vector output by the weakly supervised object detection network; Extract the score components corresponding to the categories marked as non-existent in the image-level labels from the score vector; Based on the extracted fractional components, a binary cross-entropy loss is calculated that causes these fractional components to approach zero.
8. A weakly supervised target detection device based on dual-threshold heatmap clustering candidate boxes, characterized in that, The device includes: The acquisition module is used to acquire category-related heatmaps of the input image for each target category; The construction module is used to process the category-related heatmap of each target category by applying high threshold and low threshold respectively, and to construct one or more pseudo-label box candidate clusters based on the spatial positional relationship between the threshold processing result and multiple candidate boxes pre-generated for the image. The filtering module is used to filter out a pseudo-label box from the candidate boxes contained in each pseudo-label box candidate cluster based on a classification score matrix. The training module is used to train the weakly supervised object detection network using the selected pseudo-labeled boxes to obtain the weakly supervised object detection model. The inference module is used to input the image to be detected into the weakly supervised target detection model and output the target detection result.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 7.
10. A storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 7.