Weakly supervised object detection method based on adversarial co-learning
By designing an adversarial collaborative learning network structure and loss function, the problems of labeled data dependence and low computational efficiency in weakly supervised object detection are solved, achieving high-efficiency object detection results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2024-06-25
- Publication Date
- 2026-06-26
AI Technical Summary
Existing weakly supervised target detection methods rely on a large amount of accurately labeled data, resulting in low computational efficiency and difficulty in accurately representing general information about classes, leading to insufficient detection accuracy.
We employ an adversarial collaborative learning approach, which consists of two subnetworks with identical structures but different initializations. We design various loss functions for interactive adversarial training and utilize image-level coarse annotation information to improve detection accuracy.
It improves the accuracy of weakly supervised target detection, reduces annotation costs, enhances the network's ability to perceive complete objects, and reduces computational resource consumption.
Smart Images

Figure CN118644729B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of weakly supervised target detection technology, and in particular to a weakly supervised target detection method based on adversarial collaborative learning. Background Technology
[0002] The primary task of object detection is to spatially locate objects in an image and identify their categories. With the rapid development of deep learning technology, object detection has made significant progress in multiple fields. Existing advanced object detection algorithms rely on large amounts of accurate instance-level labeled data for fully supervised training. This labeled data not only consumes considerable time and effort but also incurs high costs. Therefore, how to perform effective object detection in the absence of fine annotations has become a pressing problem.
[0003] Most existing weakly supervised object detection methods typically follow the Multiple Instance Learning (MIL) paradigm, using a two-stream structure for multi-class classification of region proposals. However, due to the intra-class diversity of objects within the same class and the lack of location information, the classification network tends to focus on local discriminative regions, easily leading to lower scores for proposal regions containing more complete objects, resulting in misclassification or missed detection. Furthermore, existing methods rely on hand-designed enhancement strategies or class feature libraries to eliminate bias in discriminative regions, but these methods are computationally inefficient and cannot accurately represent general class information. The paper "Proposal cluster learning for weakly supervised object detection" follows the MIL paradigm with its Multiple Instance Detection Network (MIDN), modeling WSOD (Weakly Supervised Object Detection) as a multi-class classification problem on a set of region proposals. Some existing improvements, such as C-MIDN and P-MIDN, attempt to remove the highest-scoring discriminative regions to find the remaining discriminative parts by designing complementary MIDN modules. However, these methods have limited effectiveness when dealing with problems with multiple instances of the same class. Furthermore, IM-CFB and NDI introduce class feature libraries to collect intra-class diversity information and absolute error classification information, using these libraries to guide model training and improve the network's ability to learn diverse features. However, these methods rely on statistical information from the global dataset and are limited by the design of the class feature libraries, making it difficult to accurately represent the general information of classes. ODCL and IENET methods are based on the idea of consistency regularization, using image enhancement and feature enhancement strategies to improve the discriminative ability of classification networks. However, due to the specific nature of object detection tasks, some regions in the image are not regions of interest (ROIs), and the method of enhancing the image and then extracting features from the ROI region is computationally inefficient. In addition, both image enhancement and feature enhancement rely on manual design methods, making it difficult to guarantee that the network can learn useful data diversity.
[0004] In summary, the shortcomings of existing technologies are:
[0005] 1. Reliance on large amounts of accurate labeled data: Advanced object detection algorithms require a large amount of accurate instance-level labeled data for fully supervised training. This labeling process is both time-consuming and expensive, limiting its application scope.
[0006] 2. Limitations of the Multiple Instance Learning (MIL) method: The MIL method performs multi-class classification of region proposals through a two-stream structure. However, due to the intra-class diversity of objects of the same class and the lack of location information, the classification task can only distinguish proposal boxes containing discriminative regions. Proposal regions containing more complete objects have lower scores and are easily misclassified or missed.
[0007] 3. Reliance on manually designed augmentation strategies: Methods such as ODCL and IENET rely on image augmentation and feature augmentation to improve the discriminative ability of classification networks. However, these augmentation methods are hand-designed, making it difficult to ensure that the network learns useful data diversity. Furthermore, performing augmentation on the image and then extracting ROI region features is computationally inefficient.
[0008] 4. Design limitations of class feature libraries: IM-CFB and NDI methods improve the network's ability to learn diverse features by collecting intra-class diversity information and absolute error discrimination information through class feature libraries. However, these methods are limited by the design of class feature libraries and have difficulty accurately representing the general information of classes. Summary of the Invention
[0009] The purpose of this invention is to provide a weakly supervised target detection method based on adversarial collaborative learning to improve the accuracy of weakly supervised target detection.
[0010] The objective of this invention can be achieved through the following technical solutions:
[0011] A weakly supervised object detection method based on adversarial collaborative learning includes the following steps:
[0012] The image data to be detected is acquired, input into a pre-trained adversarial cooperative network, and the target detection result is output. The adversarial cooperative network includes two peer network models with the same structure. Each peer network model is a standard WSOD model. The WSOD model includes a candidate region feature extractor and a task head. The task head includes a multi-instance detection module, an online instance classification module, and a detection head.
[0013] The steps for training the adversarial cooperative network include:
[0014] Obtain image data and its corresponding candidate bounding boxes;
[0015] The image data and its corresponding candidate boxes are input into the candidate region feature extractors of their respective WSOD models to extract the candidate region features.
[0016] The candidate region features are respectively input into the multi-instance detection module, online instance classification module and detection head of their respective task heads for processing, to obtain image-level classification scores and pseudo-labels, candidate box category prediction probabilities and pseudo-labels, and target instance location and category predictions, respectively.
[0017] Based on the image-level classification score, the category prediction probability of the candidate box, and the location and category prediction of the target instance, multiple loss functions are constructed to conduct interactive adversarial training on the two WSOD models until the training ends.
[0018] Furthermore, the candidate bounding boxes corresponding to the image data are obtained using the Selective Search algorithm.
[0019] Furthermore, the candidate region feature extractor includes a backbone network layer, an RoI pooling layer, and two fully connected layers. The step of extracting candidate region features includes:
[0020] Based on the image data, the backbone network layer is used to extract image features;
[0021] The image features and candidate bounding boxes are input into the RoI pooling layer and two fully connected layers for processing to obtain candidate region features.
[0022] Furthermore, the multi-instance detection module includes two parallel branch streams: a classification stream and a detection stream. Both the classification stream and the detection stream include fully connected layers and softmax layers. The steps for the multi-instance detection module to obtain the image-level classification score include:
[0023] For the classification stream, the candidate region features generate a classification score matrix P through a fully connected layer. cls The softmax layer operations are then applied to P based on the category dimension. cls To generate classification scores;
[0024] For the detection stream, the candidate region features generate a detection score matrix P through a fully connected layer. det The softmax layer's operations are then applied to P based on the candidate box dimensions. det To generate a detection score;
[0025] Based on the classification score and detection score, an image-level classification score is generated by summing the results. The expression for the image-level classification score is as follows:
[0026] P img =∑(P cls *P det )
[0027] In the formula, P img Image-level classification score.
[0028] Furthermore, the loss function includes an image-level label loss function, which is obtained by directly supervising the image-level classification score through image-level labeling. This image-level label loss function is used for supervised training of the multi-instance detection module. The expression for the image-level label loss function is:
[0029]
[0030] In the formula, L mil For image-level label loss, The image label indicates whether the image contains category c. The image classification prediction score represents the predicted probability that the image contains category c.
[0031] Furthermore, the online instance classification module includes multiple cascaded online classifiers, which progressively refine the predicted categories of candidate boxes to obtain the final category prediction probabilities of the candidate boxes.
[0032] Furthermore, the detection head includes two parallel branches, each of which includes a classifier and a regressor. The classifier and regressor generate category prediction and location prediction of the target instance based on candidate region features, respectively. In the process of generating category prediction and location prediction of the target instance, the detection head is supervised by pseudo-labels generated at the last level of the online instance classification module.
[0033] Furthermore, during the interactive adversarial training process, the pseudo-labels generated by the multi-instance detection module and the online instance classification module in one WSOD model serve as supervision signals for the multi-instance detection module and the online instance classification module in another WSOD model, generating classification loss and detection loss respectively. The expressions for the classification loss and detection loss functions are as follows:
[0034]
[0035] In the formula, L cls For classification loss, L det For the detection loss, R is the number of candidate region features, CE represents the cross-entropy loss, and R p This indicates the number of features in the candidate region of the positive sample. These represent classification and detection of pseudo-labels, respectively. This represents the predictions of the classification branch and the detection branch for the features of the r-th candidate region, and SmoothL1(·) is the smoothing absolute value loss.
[0036] Furthermore, the loss function includes a feature similarity loss function, which constrains the distance between the candidate region features and the anchor features of their corresponding categories. The anchor features of the categories are obtained during training based on the probability prediction of the high-scoring candidate box categories for each category. The expression for the feature similarity loss function is:
[0037]
[0038] In the formula, K represents the number of features for each class, and f r , L represents the r-th candidate region feature and the k-th class feature of the c-th category, respectively. simi The loss is based on feature similarity, where R is the number of candidate region features. M is an indicator function. r This indicates whether the r-th candidate region feature is a positive sample, with 1 indicating that it is a positive sample. d is the maximum value of the reliability of all class features. r This represents the average similarity between the candidate region and all class features.
[0039] Furthermore, the loss function includes a one-to-one differential loss function, which is formed by using candidate region features obtained from one WSOD model as labels to supervise the prediction of candidate region features obtained from another WSOD model. The expression of the one-to-one differential loss function is as follows:
[0040]
[0041] In the formula, This is the feature distance calculation function; 1 ensures that the calculated distance value takes values in the range [0, 1]. Let R represent the r-th candidate region feature of the first network and the r-th candidate region feature of the second network, respectively. R is the number of candidate region features, Δ indicates no gradient propagation, and L is the number of candidate region features. dis This represents a one-to-one differential loss.
[0042] Compared with the prior art, the present invention has the following beneficial effects:
[0043] (1) This invention designs a network based on adversarial co-learning, which consists of two subnetworks with the same structure but different initializations. By designing a loss function, the network is prompted to learn different representations of the same instance. By constraining the consistency of prediction results of different representations, feature co-learning is carried out to mine more valuable feature representations, thereby improving the perception of complete objects in the image and improving the accuracy of weakly supervised target detection.
[0044] (2) This invention collects general information for each category iteratively by combining two subnets, and limits the difference in feature representation to a certain range of statistical information through feature similarity loss constraints. This enables the network to perceive different feature semantic information while limiting the range of semantic information representation to reduce interference from irrelevant information. The class feature library only serves to limit the range of feature representation, thus avoiding the need for more accurate general features.
[0045] (3) The one-to-one differential loss of this invention selectively widens the features of the same instance region from two networks, forcing the two networks to extract semantic information of the same spatial location from different perspectives, thereby improving the network's ability to perceive complete objects. The range of feature widening is limited at both the feature level and the task level.
[0046] (4) This invention uses a weakly supervised learning method to train the network using only image-level coarse annotation information, thereby reducing the annotation cost.
[0047] (5) This invention enables the network to learn different representations of the same instance through feature adversarial learning and consistency constraints, and to perform collaborative learning through consistency constraints on the prediction results of different representations. Attached Figure Description
[0048] Figure 1 This is a schematic diagram of the method flow of the present invention;
[0049] Figure 2 This is a structural diagram of the WSOD model of the present invention;
[0050] Figure 3 This is a diagram of the adversarial cooperative network structure of the present invention. Detailed Implementation
[0051] The present invention will now be described in detail with reference to the accompanying drawings and specific embodiments. These embodiments are based on the technical solution of the present invention and provide detailed implementation methods and specific operating procedures. However, the scope of protection of the present invention is not limited to the following embodiments.
[0052] This embodiment provides a weakly supervised target detection method based on adversarial collaborative learning, such as... Figure 1 As shown, the method includes the following steps:
[0053] S1, Interactive Adversarial Training.
[0054] This step is for Figure 2 The adversarial collaborative network shown is trained interactively to address the challenge of effectively utilizing image-level coarse annotation information for network training in weakly supervised object detection, thereby improving the ability to perceive complete objects.
[0055] like Figure 2 and 3 As shown, the adversarial cooperative network consists of two peer networks with identical structures. These two peer networks are the basic WSOD model, which includes a candidate region feature extractor and a task head. The candidate region feature extractor includes a backbone network layer (VGG16), an RoI pooling layer, and two fully connected layers to extract candidate region features. The task head includes a multi-instance detection module, an online instance classification module, and a detection head, which input the extracted candidate region features into each module to generate image-level classification scores and pseudo-labels for the proposals (candidate regions). The concatenated outputs are the predicted class probability and pseudo-label of the candidate box, and the location and class prediction of the target instance.
[0056] The execution steps of the candidate region feature extractor are as follows:
[0057] Given image data I and its image-level label representation Y img The set of candidate region proposal boxes R is used to obtain candidate region features through a convolutional backbone, RoI pooling layer and two fully connected layers, where the candidate region proposal boxes are generated by the Selective Search algorithm.
[0058] like Figure 3 As shown, the two WSOD models have the same input, and the candidate region features are extracted by the candidate region feature extractor.
[0059] like Figure 2 As shown, the execution process of the task header is as follows:
[0060] In the multi-instance detection module, the basic WSOD module consists of a two-stream network structure based on a multi-instance learning scheme, namely a classification stream and a detection stream. The classification score matrix P... cls It is generated through a fully connected (FC) layer. Then, the softmax operation is applied along the category dimension to P. cls This is used to generate classification scores. Similarly, the detection score matrix P det It is generated through another parallel fully connected layer. Then the softmax operation is applied along the candidate box dimension to P. det This generates a detection score. Finally, an image-level classification score is generated by summing all the suggestions:
[0061] P img =∑(P cls *P det )
[0062] Image-level classification loss is trained under the supervision of image-level labels, and the expression for the image-level classification loss function is:
[0063]
[0064] In the formula, Lmil is the image-level label loss. The image label indicates whether the image contains category c. The image classification prediction score represents the predicted probability that the image contains category c.
[0065] To improve the accuracy of proposal classification, this embodiment employs an online instance classification module, combining multiple cascaded online classifiers to progressively refine the classification results. The online instance classifier consists of fully connected (FC) layers. Specifically, for each existing category, the online instance classification module selects the highest-scoring candidate box from the i-th classifier and its surrounding candidate boxes as positive samples to generate a hard pseudo-label Y. oicr-i Then, these labels are used to train the subsequent (i+1)th classifier using weighted cross-entropy loss.
[0066] Furthermore, the detection head (R-CNN) consists of two parallel branches: a classifier and a regressor, used for instance category prediction and location prediction tasks, respectively. The R-CNN head is supervised by pseudo-labels generated by the last stage of the online instance classification module; these pseudo-labels consist of both classification and regression labels. Weighted cross-entropy loss and smoothed L1 loss are applied to these two tasks, respectively.
[0067] This example uses mutual supervision between task heads to constrain the network's learning. That is, the pseudo-labels generated by the task head of Network 1 (a peer network) supervise the task head of Network 2 (another peer network), and vice versa. Specifically, the pseudo-labels generated by the multi-instance detection module and the online instance classification module of Network 1 are used as supervision signals for the online instance classification module and the detection head of Network 2, respectively, to generate the classification loss and detection loss. The classification loss and detection loss functions are:
[0068]
[0069] Where R represents the number of candidate region features, R p The number of positive sample candidate region features is represented by CE, and CE represents the cross-entropy loss. These represent classification and detection of pseudo-labels, respectively. This represents the predictions of the classification branch and the detection branch for the r-th candidate box.
[0070] This embodiment proposes a one-to-one differential loss method for candidate region features among peer networks. Its purpose is to force the network to mine other valuable semantic information by increasing the feature distance within the same region. Specifically, it uses the candidate region features of one branch network as labels to supervise the prediction of candidate region features in another branch network. The one-to-one differential loss function is:
[0071]
[0072] in These are the r-th candidate box features of the first network and the r-th candidate box features of the second network, respectively, where R is the number of candidate box features and Δ indicates that there is no gradient propagation.
[0073] To avoid invalid information such as learning background, this embodiment maintains a general class feature library (anchor feature library). This class feature library, in conjunction with two subnets, iteratively collects general information for each category. Based on the confidence level of the multi-instance detection module, it selects the candidate features that best represent each category, and updates the class feature library using their feature vectors. By utilizing feature similarity loss, which constrains the distance between candidate region features and their corresponding class features, the candidate region features are limited to the feature representation range of their corresponding categories, thus limiting the network's learning range of instance features to a certain extent. Specifically, candidate region features are categorized based on classification confidence, and the corresponding class features are used as feature labels to supervise the candidate region features. The feature similarity loss function is:
[0074]
[0075] Where f r , These represent the features of the r-th candidate bounding box and the k-th class feature of the c-th category, respectively. K represents the number of each class feature. R is the number of candidate region features. M is an indicator function. r This indicates whether the r-th candidate feature is a positive sample, with 1 indicating that it is a positive sample.
[0076] By using the loss function described above, interactive adversarial training of the adversarial cooperative network is achieved, enabling the adversarial cooperative network to achieve good detection performance.
[0077] S2, Target Detection.
[0078] This step involves inputting the image data to be detected into the adversarial cooperative network trained in step S1, which then outputs the target detection result.
[0079] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0080] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. The solutions in the embodiments of the present invention can be implemented using various computer languages, such as the object-oriented programming language Java and the interpreted scripting language JavaScript.
[0081] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0082] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0083] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0084] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the invention.
[0085] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.
Claims
1. A weakly supervised target detection method based on adversarial collaborative learning, characterized in that, Includes the following steps: The image data to be detected is acquired, input into a pre-trained adversarial cooperative network, and the target detection result is output. The adversarial cooperative network includes two peer network models with the same structure. Each peer network model is a standard WSOD model. The WSOD model includes a candidate region feature extractor and a task head. The task head includes a multi-instance detection module, an online instance classification module, and a detection head. The steps for training the adversarial cooperative network include: Obtain image data and its corresponding candidate bounding boxes; The image data and its corresponding candidate boxes are input into the candidate region feature extractors of their respective WSOD models to extract the candidate region features. The candidate region features are respectively input into the multi-instance detection module, online instance classification module and detection head of their respective task heads for processing, to obtain image-level classification scores and pseudo-labels, candidate box category prediction probabilities and pseudo-labels, and target instance location and category predictions, respectively. Based on the image-level classification score, the predicted category probability of the candidate box, the location of the target instance, and the predicted category, multiple loss functions are constructed to perform interactive adversarial training on the two WSOD models until training ends. The loss functions include a feature similarity loss function, which constrains the distance between the candidate region features and the anchor features of their corresponding categories. The anchor features of the categories are obtained during training based on the predicted category probability of the high-scoring candidate boxes for each category. The expression for the feature similarity loss function is as follows: In the formula, K This indicates the number of features for each class. The first r The characteristics of the candidate region and the first c The first category k Individual features, For feature similarity loss, R The number of candidate region features. For indicator functions, Indicates the first r Each candidate region is a positive sample, with 1 indicating a positive sample. The maximum value of the reliability of all class features. This represents the average similarity between the candidate region and all class features. In the interactive adversarial training process, the pseudo-labels generated by the multi-instance detection module and the online instance classification module in one WSOD model serve as supervision signals for the multi-instance detection module and the online instance classification module in another WSOD model, generating classification loss and detection loss respectively. The expressions for the classification loss and detection loss functions are as follows: In the formula, For classifying losses, To detect the loss, R Here, represents the number of candidate region features, and CE represents the cross-entropy loss. This indicates the number of features in the candidate region of the positive sample. These represent classification and detection of pseudo-labels, respectively. Indicate the classification branch and the detection branch for the first... r Prediction of features of candidate regions To smooth out absolute value loss.
2. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The candidate bounding boxes corresponding to the image data are obtained using the Selective Search algorithm.
3. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The candidate region feature extractor includes a backbone network layer, an RoI pooling layer, and two fully connected layers. The step of extracting candidate region features includes: Based on the image data, the backbone network layer is used to extract image features; The image features and candidate bounding boxes are input into the RoI pooling layer and two fully connected layers for processing to obtain candidate region features.
4. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The multi-instance detection module includes two parallel branch streams: a classification stream and a detection stream. Both the classification stream and the detection stream include fully connected layers and softmax layers. The steps for the multi-instance detection module to obtain the image-level classification score include: For the classification stream, the candidate region features generate a classification score matrix through a fully connected layer. P cls And apply the softmax layer's operations according to the category dimension. P cls To generate classification scores; For the detection stream, the candidate region features generate a detection score matrix through a fully connected layer. P det The softmax layer's operations are then applied based on the candidate box dimensions. P det To generate a detection score; Based on the classification score and detection score, an image-level classification score is generated by summing the results. The expression for the image-level classification score is as follows: In the formula, Image-level classification score.
5. The weakly supervised target detection method based on adversarial collaborative learning according to claim 4, characterized in that, The loss function includes an image-level label loss function. The image-level classification score is obtained through direct supervision of the image-level labels, and the image-level label loss function is used for supervised training of the multi-instance detection module. The expression of the image-level label loss function is: In the formula, For image-level label loss, Image labels indicate whether an image contains a category. c , The image classification prediction score indicates whether there is a category in the image. c The predicted probability.
6. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The online instance classification module includes multiple cascaded online classifiers. The predicted categories of candidate boxes are gradually refined through these multiple cascaded online classifiers to obtain the final predicted category probabilities of the candidate boxes.
7. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The detection head includes two parallel branches, each branch including a classifier and a regressor. The classifier and regressor generate the category prediction and location prediction of the target instance based on the candidate region features, respectively. In the process of generating the category prediction and location prediction of the target instance, the detection head is supervised by the pseudo-labels generated by the last level of the online instance classification module.
8. The weakly supervised target detection method based on adversarial collaborative learning according to claim 1, characterized in that, The loss function includes a one-to-one differential loss function, which is formed by using candidate region features obtained from one WSOD model as labels to supervise the prediction of candidate region features obtained from another WSOD model. The expression of the one-to-one differential loss function is as follows: In the formula, This is the feature distance calculation function; 1 ensures that the calculated distance value ranges from [0,1]. The first network's first r The first candidate region features and the second network's first... r Features of candidate regions R The number of candidate region features. This indicates that there is no gradient propagation. This represents a one-to-one differential loss.