Weakly supervised object detection method and system
By employing a saliency prior module, multi-target search, and area-guided weighting strategies, combined with a self-refinement module, an end-to-end weakly supervised target detection system is constructed. This system solves the local detection problem in non-rigid target detection and achieves efficient and accurate target detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2023-04-19
- Publication Date
- 2026-06-30
AI Technical Summary
Existing weakly supervised object detection techniques tend to detect local targets in non-rigid targets, and are computationally intensive and slow, making it impossible to effectively utilize image-level labels for efficient training.
We employ a saliency prior module, a multi-object search method, and an area-guided weighted strategy, combined with a self-refinement module, to construct an end-to-end weakly supervised object detection system through feature extraction, region feature extraction, and bounding box regression, and train it using image-level labels.
It improves the detection accuracy of non-rigid targets, reduces local detection, lowers computational complexity, and achieves efficient target detection.
Smart Images

Figure CN116452877B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, and more specifically, to a weakly supervised target detection method, system, terminal, and storage medium based on saliency prior information and area guidance. Background Technology
[0002] With the advent of the information age, computer vision technologies have played a vital role in various fields. Thanks to the development of Convolutional Neural Networks (CNNs) and the utilization of large-scale, meticulously labeled datasets, fully supervised object detection techniques have achieved good performance. However, the high cost of pixel-level annotation hinders object detection from becoming a widely adopted solution in practice. Therefore, weakly supervised object detection techniques using only image-level labels have become an effective approach and have been extensively researched, requiring only a small cost to add large amounts of training data. However, precisely because of the reduction in pixel-level positional information, this mismatch between input and output information results in a significant performance gap between weakly supervised and fully supervised object detection.
[0003] Multiple Instance Learning (MIL) is an effective and mainstream solution. It treats an image as a bag and each region as an instance, learning by continuously fitting similar regions (instances) across multiple images (bags). However, during training, the network constantly selects the highest-scoring proposal as a positive sample, which can easily lead to local detection, i.e., only detecting the most discriminative parts.
[0004] Deep neural networks can extract image features better than traditional algorithms. Building on this, methods such as contextual information, self-learning, bounding box regression, multi-task learning, and reinforcement learning are used to find more complete bounding boxes. These methods can learn the complete category and localization of an object using only image-level supervision. In their paper, H. Bilen, A. Vedaldi. Weakly supervised deep detection networks [C] / / Proceedings of the IEEE conference on computer vision and pattern recognition. 2016:2846-2854., Bilen et al. combined CNN feature extraction with the MIL detector and scored features on a two-stream branch network for classification and detection, which became the mainstream framework for subsequent MIL schemes. Tang et al., in P Tang, X Wang, X Bai, et al. Multipleinstance detection network with online instance classifier refinement[C] / / Proceedings of the IEEE conference on computer vision and pattern recognition.2017:2843-2851., proposed online instance classification refinement. By having each flow of the refinement module supervise the next flow, a more complete bounding box is gradually obtained in the iteration. Yang et al., in K Yang, D Li, Y Dou. Towards precise end-to-end weakly supervised object detection network[C] / / Proceedings of the IEEE / CVF International Conference on Computer Vision.2019:8372-8381., added a bounding box regression module of a fully supervised network and fused this module with the MIL detector into an end-to-end network. The two networks share the same backbone network, thereby obtaining better localization. Attention guidance module is also introduced to extract target features more effectively. Although the proposed weakly supervised target detection algorithms have made significant improvements compared to traditional detection methods, the networks are still weak in detecting non-rigid targets, mainly due to their tendency to focus on local detection.
[0005] The search revealed:
[0006] Chinese invention patent application CN113378829A, published on September 10, 2021, entitled "A Weakly Supervised Object Detection Method Based on Positive and Negative Sample Balance," collects scene images for object detection and creates labels. The training set is input into a target candidate box filtering module, which uses a selective search method to obtain all target candidate boxes for the scene image. The module calculates the environment coefficient of all target candidate boxes based on the weakly supervised semantic segmentation result M corresponding to the scene image. The environment coefficients of all target candidate boxes are sorted, and the top-ranked target candidate boxes are selected as the initial target candidate boxes for the scene image. A weakly supervised object detection network is established, and the obtained training set and initial target candidate boxes are simultaneously input into the weakly supervised object detection network for training. During training, an optimal target box update method is used to obtain the trained weakly supervised object detection network. However, this method requires additional training of a weakly supervised semantic segmentation network to calculate the environment coefficients and does not optimize for non-rigid objects, resulting in significant local detection issues.
[0007] Chinese invention patent application CN114972711A, published on August 30, 2022, entitled "An Improved Weakly Supervised Object Detection Method Based on Semantic Information Candidate Boxes," preprocesses the training set, including random horizontal flipping. It designs a combined backbone network to fuse features from masked and non-masked network branches. The non-masked branches coarsely locate locally significant target regions, while the masked branches mask salient features and retain responses to less obvious features. A multi-branch detection head network based on a multiple instance selection algorithm is designed to generate pseudo-realistic target boxes with high confidence for training. The semantic information of the targets generated by the multi-branch detection head network model is iteratively masked to generate more reasonable target candidate boxes. This method requires iterative mask generation within the network, resulting in high computational cost in the feature extraction part and low detection speed. Furthermore, while the network focuses on solving multi-target problems in images, the mask may converge to localized areas during iteration, leading to the final localization of only a portion of the target region.
[0008] Currently, no descriptions or reports of technologies similar to this invention have been found, and no similar information has been collected domestically or internationally. Summary of the Invention
[0009] To address the shortcomings of existing technologies, the purpose of this invention is to provide a weakly supervised target detection method and system.
[0010] According to one aspect of the present invention, a weakly supervised target detection method is provided, comprising:
[0011] Feature extraction is performed on the input image;
[0012] The features are refined to obtain enhanced features;
[0013] Suggestions for extracting from the input image;
[0014] The enhanced features are extracted from the proposed region features through a ROIPooling layer;
[0015] A fully connected layer is applied to the region features to obtain the feature matrix x. cls ,
[0016] σ(x) is obtained by performing softmax on two different dimensions: category and suggestion. cls ), σ(x det ), through element product Receive all suggested scores;
[0017] Obtain the image score for category c, which is the sum of all suggestions and the scores for that category:
[0018] The suggested score, image score, saliency prior module, multi-objective search method, refinement branch, bounding box regression branch, and area-guided weighted strategy are associated with each other.
[0019] Preferably, during training, the suggested score and image score are obtained through the MIL detection head;
[0020] As the output of the MIL detection head, it is refined through the Refinement branch; each branch of the K branch consists of an independent FClayer and softmax layer, and uses the saliency prior module combined with the multi-object search method to find rich and as complete as possible target suggestions as pseudo-bounding boxes;
[0021] Add the bounding box regression branch after the last refinement branch;
[0022] Regarding the proposed scores in the Refinement branch and the bounding box regression branch, the network is encouraged to search for targets over a larger area by using the area-guided sample weighting strategy.
[0023] Preferably, the step of refining the features to obtain enhanced features includes:
[0024] For the input features, a 1×1 convolutional block is used to obtain the mask W and bias b for multiplication and addition;
[0025] The input features after masking and biasing are passed through a softmax function to obtain the enhancement layer A = σ(W⊙fin +b)
[0026] Where A is the weight, f in It is the input feature map, and σ is the softmax function;
[0027] After feature enhancement, the output feature f is obtained. out :
[0028] f out =(1+A)f in .
[0029] Preferably, the process in the saliency prior module includes:
[0030] Extracting multiple traditional features from the input image I yields the saliency prediction map I. s ;
[0031] For prediction graph I s Corrosion and expansion operations are performed to correct the saliency, resulting in a corrected saliency map.
[0032]
[0033] Where F e and F d These are the corrosion and expansion functions, respectively;
[0034] For the saliency map Threshold segmentation is performed to obtain the salient bounding boxes of all connected regions. in This represents the number of connected regions.
[0035] Select the bounding box of the connected region with the largest area as a trustworthy pseudo-bounding box for the current training image. s And the score that is the same as the highest-scoring sample is represented as:
[0036]
[0037] The first filter is used to filter images, retaining only the salient bounding boxes of images belonging to a single category;
[0038] The second filter is used to remove salient false bounding boxes that occupy more than half of the image and whose predicted values are discretely distributed.
[0039] Preferably, the search results obtained using the multi-objective search method are used to filter the bounding box obtained by the saliency module, including:
[0040] A multi-objective search method is used to obtain all candidate suggestions for the input image.
[0041] For all candidate suggestions, sort them in descending order of score for each category, and select the suggestions with the highest scores as a set value to form a candidate pool.
[0042] Iteratively, the highest-scoring suggestion from the candidate pool is entered into the pseudo-bounding box set. And remove suggestions from the candidate pool whose intersection-union (IOU) with the selected pseudo-bounding boxes is greater than τ, until there are no suggestions in the candidate pool or the set of pseudo-bounding boxes. The suggestion limit has been reached;
[0043] The trustworthy pseudo-bounding box b obtained by the saliency module s At least and obtained multi-target search results If a match is found in one of the information, it is considered to be a valid saliency prior.
[0044] Preferably, the area-guided sample weighting strategy includes:
[0045] For each class c present in the image, a set of pseudo-bounding boxes is obtained during the k-branch. in For possible saliency pseudo-bounding boxes, n c This represents the number of pseudo-bounding boxes obtained from the multi-target search.
[0046] Each of the pseudo-bounding boxes generates a corresponding positive sample cluster, and the positive sample clusters have the same pseudo-class label y. k and initial training weights λ k ;
[0047] Calculate the area of all samples in each cluster, and sort them in descending order of size. The ranking is represented as... Each s corresponds to a sample rank, and N is the number of positive samples;
[0048] A distribution coefficient μ in the range [0, 1] is calculated based on the ranking. i
[0049]
[0050] The weights of all samples in the cluster are calculated using a linear function to complete the allocation coefficient μ. i To weight ω i The mapping is represented as:
[0051] ω i = (1+α)-2α·μ i
[0052] Where α is half the difference between the maximum weight and the minimum weight;
[0053] The area-guided weights decrease as the number of iterations during training increases, eventually approaching zero. A linear function is used to control these weights, as shown below:
[0054]
[0055]
[0056] Where β is the total number of iterations, ε3 is the proportion of the iterations when weight allocation disappears to the total number of iterations, and υ is the current training iteration during the weight allocation process.
[0057] Preferably, the network constrains the MIL detection head, refinement branch, bounding box regression branch, and self-refinement module using a hybrid loss function, as shown below:
[0058]
[0059] in It is the loss function of the MIL detection head. It is the loss function of the refinement branch. It is the loss function of the bounding box regression branch. It is the loss function of the self-refinement module, a SR This represents the loss weights for the self-refinement module.
[0060] According to a second aspect of the present invention, a weakly supervised target detection system is provided, comprising:
[0061] Feature extraction networks are used to extract features from input images;
[0062] The self-refinement module enhances the extracted features;
[0063] It is recommended to select a module to extract suggestions from the input image;
[0064] The ROIPooling layer extracts the proposed enhanced region features.
[0065] The feature matrix x is obtained through two fully connected layers. cls ,
[0066] The calculation module performs softmax on two different dimensions, category and suggestion, to obtain σ(x). cls ), σ(x det ), through element product Get all suggestion scores; get the image score in category c, which is the sum of all suggestion scores in that category:
[0067] The suggested score, image score, saliency prior module, multi-objective search method, refinement branch, bounding box regression branch, and area-guided weighted strategy are associated with each other.
[0068] According to a third aspect of the present invention, a terminal is provided, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, can be used to execute any of the weakly supervised target detection methods described herein, or to run the weakly supervised target detection system described herein.
[0069] According to a fourth aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, can be used to perform any of the weakly supervised target detection methods described in the present invention, or to run the weakly supervised target detection system described in the present invention.
[0070] Compared with the prior art, the present invention has at least one of the following beneficial effects:
[0071] The image weakly supervised target detection method and system in this embodiment of the invention utilizes saliency priors to guide network training, which is more conducive to network initialization and ensures the stability of network training.
[0072] The image weakly supervised target detection method and system in this embodiment of the invention employs a multi-target search module, which can use as many targets in the image as possible during training, while avoiding ambiguity caused by classifying targets as background.
[0073] The image weakly supervised target detection method and system in this embodiment of the invention employs a multi-target search module, which can further filter out salient features during training and remove erroneous salient targets.
[0074] The image weakly supervised target detection method and system in this embodiment of the invention designs an area-guided weighted algorithm to encourage the network to search for targets over a larger area, thereby avoiding local detection of non-rigid targets.
[0075] The image weakly supervised target detection method and system in this embodiment of the invention employs a feature self-refinement module, which can better highlight target features and avoid noise introduced by multi-scale transformation.
[0076] The image weakly supervised target detection method and system in this embodiment of the invention directly uses the original image for training. Without introducing additional high-complexity models on the main basic network, it can effectively train an end-to-end target detection model.
[0077] The image weakly supervised target detection method and system in this embodiment of the invention are significantly better than existing algorithms in terms of the detection effect of non-rigid targets. Attached Figure Description
[0078] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:
[0079] Figure 1 This is a flowchart of a weakly supervised target detection method according to an embodiment of the present invention;
[0080] Figure 2 This is a schematic diagram of the structure of a weakly supervised target detection model based on saliency prior and area guidance in a preferred embodiment of the present invention;
[0081] Figure 3 This is a flowchart illustrating a preferred embodiment of the present invention for obtaining a pseudo-bounding box using saliency priors;
[0082] Figure 4 This is a subjective comparison diagram of the method of this application and OICR in a preferred embodiment of the present invention. Detailed Implementation
[0083] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention. These all fall within the scope of protection of the present invention.
[0084] See Figure 1 This invention provides a weakly supervised target detection method, the process of which is as follows:
[0085] S1, feature extraction from the input image;
[0086] S2 refines the features extracted in S1 to obtain enhanced features;
[0087] S3 provides suggestions for extracting input images;
[0088] S4, extract the proposed region features from S3 by passing the enhanced features from S2 through a ROIPooling layer;
[0089] S5, apply a fully connected layer to the regional features obtained in S4 to obtain the feature matrix x. cls ,
[0090] S6, σ(x) is obtained by performing softmax on two different dimensions: category and suggestion. cls ), σ(x det ), through element product Get all suggestion scores; get the image score in category c, which is the sum of all suggestion scores in that category:
[0091] Among them, the suggestion score, image score and saliency prior module, multi-objective search method, refinement branch, bounding box regression branch and area-guided weighted strategy are associated.
[0092] This embodiment incorporates self-refinement, a saliency prior module, a multi-object search method, and an area-guided weighting strategy, effectively improving the weakly supervised object detection system's ability to detect objects, especially non-rigid objects.
[0093] In a preferred embodiment of the present invention, during network training, the suggestion score and image score are obtained through the MIL detection head; as the output of the MIL detection head, they are refined through the Refinement branch; each branch of the K branch consists of an independent FClayer and softmax layer, and uses a combination of a saliency prior module and a multi-object search method to find rich and as complete as possible target suggestions as pseudo-bounding boxes; a bounding box regression branch is added after the last Refinement branch; regarding the suggestion scores in the Refinement branch and the bounding box regression branch, an area-guided sample weighting strategy is used to encourage the network to search for targets in a larger area.
[0094] Further, see Figure 2 The execution steps of the weakly supervised target detection method in this embodiment regarding self-refinement, saliency prior module, multi-target search method, and area-guided weighting strategy are as follows:
[0095] S100, Self-refinement: Removes noise caused by multi-scale transformation and highlights effective feature regions.
[0096] S200, Saliency Prior Feature Module: Extracts saliency prior features from the input image and performs preliminary screening based on the image's category information and the features of the saliency map;
[0097] S300, a multi-target search method: obtains as many target instances of the same category as possible in the image, while effectively utilizing more effective features and providing bidirectional guidance from the saliency prior module;
[0098] S400, the results of the multi-objective search are used to filter the bounding boxes obtained from the saliency prior: if the region obtained by the saliency prior module and the pseudo bounding box obtained by the multi-objective search do not match, it is considered that incorrect saliency prior information has been input;
[0099] S500, an area-guided sample weighting strategy: encourages the network to search for targets over a wider area, avoiding the network focusing too much on the most discriminative parts of the target.
[0100] The above embodiments construct a weakly supervised object detection model based on saliency prior and area guidance; saliency is extracted from the training set; the generated saliency information is filtered; the saliency information is used to assist the weakly supervised detection model in training, resulting in an object detection model; the image dataset to be detected is input into the object detection model to obtain the category and localization results of the corresponding objects in the images. Traditional features and sample weights guide the network to focus on the entire object rather than a specific region. Coarse pseudo-bounding boxes provided by traditional saliency priors assist the network in better initialization; a multi-object search module is combined to jointly discover more and more accurate target features in the image; an area-related weighting strategy is designed to encourage the network to search for targets from a larger region, further avoiding local control; effectively improving the detection capability of the weakly supervised object detection system for objects, especially non-rigid objects.
[0101] In a preferred embodiment of the present invention, S100, which performs feature enhancement with respect to self-thinning, may include the following steps:
[0102] S101 uses 1×1 convolutional blocks to obtain masks W and biases b for multiplication and addition. The processed features are then passed through a softmax function to obtain the enhancement layer. This is represented as follows:
[0103] A=σ(W⊙f in +b)
[0104] Where A is the weight, f in It is the input feature map, and σ is the softmax function;
[0105] S102, after feature enhancement, the output feature f is obtained. out :
[0106] f out =(1+A)f in
[0107] S103, weight A is obtained by passing it through a 3×3 and a 1×1 convolutional layer, and then using standard multi-label classification loss for supervision. Where C represents the number of categories; y c It is either 0 or 1; 1 if category c exists in the image, and 0 otherwise; φ c The score is for category c.
[0108] This embodiment better highlights the target features and avoids noise introduced by multi-scale transformation.
[0109] In a preferred embodiment of the present invention, the salience prior feature module in S200 is constructed, such as... Figure 3 As shown, it may include the following steps:
[0110] S201, firstly, a saliency prediction map I of the input image I is obtained through a saliency module that utilizes traditional features such as superpixels, contrast, color, and texture features. s ;
[0111] S202, For prediction diagram I s Corrosion and expansion operations are performed to correct the saliency, resulting in a corrected saliency map. While strengthening the internal connectivity of the target, the target region is separated from the background, as shown below:
[0112]
[0113] Where F e and F d These are the corrosion and expansion functions, respectively;
[0114] S203, perform thresholding on the corrected image to obtain the bounding boxes of all connected regions. in The number of connected regions. The bounding box of the connected region with the largest area is selected as a trusted pseudo-bounding box for the current training image. s , can be represented as follows:
[0115]
[0116] b s The MIL detection head calculates a score for each proposed bounding box in a specific category; this score is a numerical value. The score obtained by the network for the saliency bounding box may not be the highest, but because the quality of the obtained saliency bounding box is relatively high, it is directly assigned the highest score.
[0117] S204, the first filter is used for screening, and only the saliency prior information of single-class images is used;
[0118] S205 uses a second filter to remove salient pseudo-bounding boxes that account for more than half of the image and whose predicted values are discretely distributed, thereby removing obviously erroneous samples with backgrounds such as gravel or forests.
[0119] This embodiment utilizes saliency priors to guide network training, which is more conducive to network initialization and ensures the stability of network training.
[0120] In a preferred embodiment of the present invention, the multi-target search method in S300 may include the following steps:
[0121] S301, For the training set images, obtain all candidate proposals for the images using a selective search method.
[0122] S302, For all generated suggestions, sort them in descending order of score for each category, and select the suggestions with the highest scores (p=0.15) to form a candidate pool.
[0123] S303, Iterative acquisition of the candidate pool The highest-scoring suggestion should be entered into the pseudo-bounding box set. And remove suggestions from the candidate pool that have an IOU greater than τ with the selected pseudo-bounding box, until there are no suggestions or pseudo-bounding box sets in the candidate pool. The suggestion limit of k has been reached. MAX = 3.
[0124] This embodiment can use as many targets in the image as possible during training, while avoiding ambiguity caused by classifying targets as background.
[0125] In a preferred embodiment of the invention, step S400 is implemented, where the bounding boxes obtained from the saliency prior are filtered using the results of the multi-objective search. If none of the bounding boxes obtained from the saliency prior for the current image have an IOU of τ = 0.2 with any of the bounding boxes obtained from the multi-objective search for the image, the obtained saliency prior information is not used during training.
[0126] This embodiment can further filter out salient features during training, eliminating erroneous salient targets.
[0127] In a preferred embodiment of the present invention, the area-guided sample weighting strategy of S500 may include the following steps:
[0128] S501, for the c categories present in the image, the pseudo-bounding box set is obtained during the k-branch. in For possible saliency pseudo-bounding boxes, |n c | represents the number of pseudo-bounding boxes obtained from the multi-object search. Each pseudo-bounding box generates a corresponding positive sample cluster, which have the same pseudo-class label y. k and initial training weights λ k ;
[0129] S502 calculates the area of all samples in each cluster and sorts them in descending order of size. The ranking is represented as... Each s corresponds to a sample rank, and N is the number of positive samples;
[0130] S503, scale the ranking to [0,1] to obtain an allocation coefficient μ. i , means as follows:
[0131]
[0132] S504 uses a linear function to calculate the weights of all samples in the cluster, completing the allocation coefficient μ. i To weight ω i The mapping is represented as follows:
[0133] ω i = (1+α)-2α·μ i
[0134] Preferably, α is half the difference between the maximum and minimum weights;
[0135] In S505, the area-guided weights decrease with the number of iterations during training, eventually approaching zero. A linear function is used to control the weights, as shown below:
[0136]
[0137]
[0138] Where β is the total number of iterations, ε3 is the proportion of the iterations when weight allocation disappears to the total number of iterations, and υ is the current training iteration in the weight allocation process;
[0139] S506, Design the loss function for the k-branch:
[0140]
[0141]
[0142] in It is the loss of the k-th refinement branch. It is the total loss function for refining the branches.
[0143] In this embodiment, a positive sample cluster is a set of positive samples, which are bounding boxes that are considered to contain valid target categories after being filtered by the network. Conversely, samples that are considered by the network to not contain targets, i.e., containing background, are negative samples. A positive sample cluster is generated from positive samples. After obtaining a positive sample, a bounding box with a high degree of overlap with that positive sample is considered to also contain the same valid target. In other words, a positive sample and its highly overlapping bounding box together form a positive sample cluster.
[0144] The k-branch in this embodiment is a form of knowledge distillation, used to better learn the features of the target. The principle of knowledge distillation is that it's easier to train from the output of a pre-trained network than directly from the source data. Using the K-branch refinement effectively improves detection capabilities. This embodiment encourages the network to search for targets over a wider range, thereby avoiding localized detection of non-rigid targets.
[0145] Based on the same inventive concept, in other embodiments of the present invention, a weakly supervised object detection system is provided. This system utilizes a basic framework of multiple instance learning, building upon a foundational network of OICR (see P Tang, X Wang, X Bai, et al. Multipleinstance detection network with online instance classifier refinement[C] / / Proceedings of the IEEE conference on computer vision and pattern recognition.2017:2843-2851.) and bounding box regression branches. It constructs an end-to-end weakly supervised object detection system through a saliency prior bounding box acquisition module, a multi-object search filtering module, an area-guided weighted algorithm, and a self-refinement module. The model is then trained using a general dataset, where saliency prior information can be generated offline, thus not affecting normal training time and complexity. Finally, a model for object detection is obtained.
[0146] Furthermore, a weakly supervised object detection system includes a feature extraction network for extracting features from the input image; a self-thinning module for enhancing the extracted features; and a suggestion selection module for extracting suggestions from the input image.
[0147] The ROI Pooling layer extracts the proposed enhanced region features; the fully connected layer, through two fully connected layers, yields the feature matrix x. cls , The calculation module performs softmax on two different dimensions, category and suggestion, to obtain σ(x). cls ), σ(x det ), through element product Get all suggestion scores; get the image score in category c, which is the sum of all suggestion scores in that category: Among them, the suggestion score, image score and saliency prior module, multi-objective search method, refinement branch, bounding box regression branch and area-guided weighted strategy are associated.
[0148] Based on the same inventive concept, in other embodiments of the present invention, a terminal is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it can be used to execute any of the weakly supervised target detection methods described above, or to run the weakly supervised target detection system described above.
[0149] Based on the same inventive concept, in other embodiments of the present invention, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, can be used to perform any of the weakly supervised target detection methods described above, or to run the weakly supervised target detection system described above.
[0150] Based on the above concept, a specific embodiment adopts the following technical solution:
[0151] A weakly supervised target detection method includes the following steps:
[0152] Step 1: On the basic framework of multi-instance learning and bounding box regression network, add a bounding box acquisition module with saliency prior, a screening module for multi-object search, an area-guided weighting algorithm, and a self-refining module to build an end-to-end weakly supervised object detection system;
[0153] Step 2: Obtain salient pseudo-bounding boxes: Obtain salient prior information offline from the image training set and obtain the salient pseudo-bounding boxes of the images;
[0154] Step 3: Train the weakly supervised object detection using the training set images: During training, saliency information is initially filtered to remove obviously erroneous samples from multi-object image samples. This is because saliency information cannot distinguish image categories and is affected by factors such as lighting, sometimes resulting in saliency information with obvious localization errors (e.g., localizing to forest or sky). Multi-object search also filters saliency information to remove saliency prior information that fails to locate objects.
[0155] Step 4: Classification and localization of targets in the image: The model trained in Step 3 above (i.e., ...) Figure 1 In the model shown, by inputting images of different sizes to be detected, the category and bounding box of the target in the image can be obtained.
[0156] In this preferred embodiment, the structure of the weakly supervised target detection model based on saliency prior and area guidance is as follows: Figure 1As shown, the network can be programmed and simulated on a single NVIDIA GTX 1080Ti, Ubuntu 16.04, and PyTorch deep learning framework. First, a weakly supervised object detection model is designed using the main structure of the MIL method, including a feature extraction network, a MIL detection head, a refinement branch, and a bounding box regression branch. Simultaneously, a newly constructed saliency prior module, a multi-object search module, an area-guided suggestion weighting module, and a self-refinement module are added to construct the final training model. Then, the original training dataset is augmented with images, and the constructed model is trained end-to-end. Finally, a hybrid loss function is used to constrain the MIL detection head, refinement branch, bounding box regression branch, and self-refinement module, resulting in a final model that can derive object categories and localizations from images—that is, a weakly supervised object detection model based on saliency priors and area guidance.
[0157] As a preferred embodiment, in step 1, a weakly supervised target detection model based on saliency prior and area guidance is proposed, with the network structure as follows: Figure 1 As shown.
[0158] The model uses a VGG-16 deep neural network to extract features from the image, and then enhances the features through a self-thinning module.
[0159] Specifically, the input only contains image-level labels Y = {y1, y2, ... y C Let I be an image from the dataset, where C is the number of image categories in the dataset, and y is a variable. c =1 or 0 indicates that at least one target of class c is present or absent in the image. The training set is used to obtain the object proposal set through selective search. in This indicates the number of suggestions generated. The network output features are processed through a ROIPooling layer to extract the region features of the suggestions, and then processed through two fully connected layers to obtain the features at x. cls , σ(x) is obtained by performing softmax on two different dimensions: category and suggestion. cls ), σ(x det (Through element-wise product) Calculate the scores for all suggestions. Finally, the image's score in category c can be obtained by summing the scores of all suggestions in that category: With image-level supervision (image category supervision), the multi-class cross-entropy loss function can train an instance classifier end-to-end, as shown in the following formula.
[0160]
[0161] As the output of the MIL detector head in the model, it is refined through the Refinement branch. Each branch of the K branch consists of an independent FC layer and a softmax layer, and uses a combination of saliency prior and multi-object search to find rich and as complete as possible object suggestions as pseudo-bounding boxes.
[0162] A bounding box regression branch was added after the last refinement branch, and the loss of the branch was obtained using Smooth-L1 loss.
[0163] Regarding the suggestion scores in the refinement branch and bounding box regression branch of the model, by using an area-guided sample weighting strategy, the algorithm encourages the network to search for targets over a larger area by assigning different weights to the positive sample clusters obtained from a pseudo-bounding box according to their area size. This makes it more likely that the network will learn complete feature information.
[0164] As a preferred embodiment, in step 2, the saliency prior information of all training set images is first obtained offline. By performing operations such as erosion, dilation, and binarization on the saliency information, a reliable saliency pseudo-bounding box is obtained.
[0165] In a preferred embodiment, in step 3, the saliency prior information obtained in step 2 is processed using a two-layer filter to obtain image information with a single class and simple background. However, the saliency prior may still incorrectly classify a significant background as a foreground object. Therefore, a constructed multi-object search module is used to discover as many instances of the same class as possible in the image and to discriminate the saliency prior. Because saliency and the highest-scoring suggestion may be located in different objects, multi-object search is necessary to avoid discarding correct saliency pseudo-bounding boxes.
[0166] After the above three-layer screening (first layer, saliency prior screening; second layer, two-layer filter screening; third layer, multi-objective search screening), saliency information that can be correctly used for training is obtained.
[0167] In a preferred embodiment, in step 3, the training set images used are 5011 images from PASCALVOC2007 (see MEveringham, LVanGool, I. Williams CK, et al. The pascal visual object classes (voc) challenge [J]. International Journal of Computer Vision, 2010, 88(2): 303-338.) and 11540 images from PASCALVOC2012 (see MEveringham, MEslami S, LVanGool, et al. The pascal visual object classes challenge: A retrospective [J]. International Journal of Computer Vision, 2015, 111(1): 98-136.). For each image participating in training, only image-level annotations are used for training, i.e., the target categories contained in the image. The input image is processed using a multi-scale setting, with five scales randomly selected from {480, 576, 688, 864, 1200} and mirror flipped for training. During testing, the average score across the 10 scales is used as the network's final output.
[0168] In a preferred embodiment, the number of thinning branches in the network is set to K=3 in step 1. The extra layers in the self-thinning module are initialized with a Gaussian distribution with a mean of 0, a standard deviation of 0.02, and an initial bias of 0.
[0169] During the training phase, the mini-batch size was set to 4, and SGD was used to optimize the network training. For PASCALVOC2007 and PASCALVOC2012, the maximum number of iterations was set to 100K and 120K, respectively. The learning rate was 0.001 for the first 55K and 65K iterations, decreasing by a factor of 10 until training ended. Momentum and weight decay were set to 0.9 and 5e-4, respectively. The loss weight α of the self-refinement module... SR =0.3.
[0170] In the multi-object search, the upper limit for the number of objects of the same category in a single image is set to 3. During training, the first 5K iterations use saliency priors and the highest-scoring proposal. Afterward, the multi-object search strategy replaces the highest-scoring proposal, and saliency boxes are filtered after 0.3 iterations. The area-guided sample weighting strategy decays to 0 at 0.7 iterations.
[0171] The network uses a hybrid loss function to constrain the MIL detection head, refinement branch, bounding box regression branch, and self-refinement module, as shown below:
[0172]
[0173] in It is the loss function of the MIL detection head. It is the loss function of the refinement branch. It is the loss function of the bounding box regression branch. This is the loss function of the self-refinement module. Preferably, the loss weight 'a' of the self-refinement module... SR =0.3.
[0174] After training, a weakly supervised object detection model can be obtained for detection.
[0175] In step 4, the target category and location can be obtained by inputting an image of any size to be detected into the model trained in the above steps.
[0176] It should be noted that during detection in step 4, the input image is processed by the VGG-16 feature extraction network, the sub-refinement module performs feature enhancement, selective search obtains the proposal set, and after ROIpolling and two fully connected layers, the features no longer pass through the MIL detection head, but only through the refinement branch and the bounding box regression branch to obtain the results. The coordinates of the bounding boxes are obtained by the bounding box regression module, and then NMS is used to determine the final detection boxes. The category of the detection box is jointly determined by the refinement and regression modules. Specifically, the scores of several refinements are first averaged, and then the resulting value is averaged with the score obtained by the bounding box regression module. The saliency prior module, multi-object search, and area-guided weighting strategy are not used; they are only used for training.
[0177] The model is then used to evaluate the datasets on 4952 test images from the public dataset PASCALVOC2007 and 10991 test images from the public dataset PASCALVOC2012, respectively, to detect which target categories (e.g., animals) exist in the datasets. The specific process is as follows:
[0178] The first step is to extract features from the images in the dataset, such as color, shape, and texture.
[0179] The second step is to refine the features to obtain enhanced features;
[0180] The third step is to suggest image categories for the above dataset.
[0181] The fourth step is to enhance the features by extracting proposed region features through a ROIPooling layer;
[0182] The fifth step involves applying a fully connected layer to the region features and then feeding the results into the refinement and regression branches to obtain the scores for all suggestions in each branch. (During training, the detection head is used to obtain suggestion scores; during application, the refinement and regression branches are used to obtain suggestion scores.)
[0183] The sixth step involves obtaining the location and category scores of objects in the image. The location is determined by using NMS on all proposed boxes, and the score is the average score of that proposed box across all branches. The image is also multi-scaled and flipped, and the average score is calculated across 10 outputs.
[0184] The seventh step is to determine whether the datasets PASCALVOC2007 and PASCAL VOC2012 contain birds, cats, dogs, sheep, or other animals based on the final score.
[0185] The proposed score, image category and saliency prior module, multi-objective search method, refinement branch, bounding box regression branch, and area-guided weighted strategy are associated.
[0186] Of course, other images or image sets that require target (animal, plant or still life) identification can also be identified and detected using the methods and systems of this invention.
[0187] The recognition results on the aforementioned public dataset follow the PASCALVOC protocol, selecting results where the Intersection over Union (IOU) between the ground truth bounding boxes and the predicted bounding boxes is greater than 0.5. Two evaluation metrics are mAP and CorLoc. Mean Accuracy (AP) and Mean AP (mAP) are used to evaluate the model's detection capability on the test set. Correct Localization (CorLoc) is used to evaluate the network's learned localization capability on the training set.
[0188] Table 1. Detection performance (AP, %) of different methods on the PASCALVOC2007 test set. The highest score is indicated in bold, and the second highest score is indicated by underline.
[0189]
[0190] Table 2 shows the localization performance (CorLoc, %) of different methods on the PASCALVOC2007 training set. The highest score is indicated in bold, and the second highest score is indicated by underline.
[0191]
[0192] Table 3 shows the detection performance (AP, %) of different methods on the PASCALVOC2012 test set. The highest score is indicated in bold, and the second highest score is indicated by underline.
[0193]
[0194] Table 4 shows the localization performance (CorLoc, %) of different methods on the PASCALVOC2012 training set. The highest score is indicated in bold, and the second highest score is indicated by underline.
[0195]
[0196] As shown in Tables 1, 2, 3, and 4, the algorithm achieves leading detection and localization performance on both general datasets. The tables also demonstrate that it achieves the best performance on non-rigid objects such as birds, cats, dogs, and sheep, indicating that the algorithm of this invention can better solve the local detection problem of non-rigid targets. Furthermore, the method of this invention does not introduce additional complexity or inference time into the original framework, exhibiting good portability. Figure 4 The paper presents a subjective comparison between our method and OICR (the left side of each set of images is the true bounding box, the middle is the result of OICR, and the right side is the result of this application). It can be seen that our method has better detection results for non-rigid targets.
[0197] The weakly supervised object detection method and system provided in the above embodiments of the present invention are based on a saliency prior and an area-guided sample weighting strategy. First, a coarse pseudo-bounding box provided by a traditional saliency prior is used to assist the network in better initialization. Furthermore, considering that the MIL method tends to detect the most discriminative local regions of the target, an area-guided sample weighting strategy is used in the detection network to encourage the network to search for targets from a larger region. To better utilize saliency information, the results of multi-object detection are used to filter saliency pseudo-bounding boxes, thereby removing erroneous feature information. In addition, a self-refinement module is used to enhance the features, and together with the detection, refinement, and regression modules, a hybrid loss function is formed to train the proposed network. Experiments were conducted on two commonly used public datasets, VOC2007 and VOC2012. Experimental results show that the present invention can effectively improve the performance of weakly supervised object detection networks, especially for non-rigid objects.
[0198] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the essence of the present invention. The above preferred features can be used in any combination without conflict.
Claims
1. A weakly supervised object detection method, characterized in that, include: Feature extraction is performed on the input image; The features are refined to obtain enhanced features; Suggestions for extracting from the input image; The enhanced features are extracted from the proposed region features through a ROI Pooling layer; using a fully connected layer on the region features to obtain a feature matrix ; The classification and the recommendations are performed on two different dimensions The classification and the recommendations are performed on two different dimensions , The classification and the recommendations are performed on two different dimensions The classification and the recommendations are performed on two different dimensions Get the image in the category The image score is obtained by summing all suggestions for that class: ; Specifically, the relationship between the suggested score, image score, saliency prior module, multi-object search method, refinement branch, bounding box regression branch, and area-guided weighted strategy is as follows: During training, the suggested scores and image scores are obtained through the MIL detection head; The output of the MIL detection head is refined through the Refinement branch; each branch of the K branch consists of an independent FC layer and The system consists of layers, and uses a combination of the saliency prior module and a multi-object search method to find rich and as complete as possible object suggestions as pseudo-bounding boxes; Add the bounding box regression branch after the last refinement branch; Regarding the proposed scores in the Refinement branch and the bounding box regression branch, the network is encouraged to search for targets over a larger area by using the area-guided sample weighting strategy.
2. The weakly supervised target detection method according to claim 1, characterized in that, The process of refining the features to obtain enhanced features includes: For input features, use The convolutional blocks are used to obtain masks for multiplication and addition. and bias ; The input features after masking and biasing are then processed by a... The enhancement layer is obtained after the function. ; Where A is the weight, It is the input feature map. yes function; After feature enhancement, the output features are obtained. : 。 3. The weakly supervised target detection method according to claim 1, characterized in that, The process in the saliency prior module includes: Extract input image Multiple traditional features were used to obtain a saliency prediction map. ; For the prediction map Corrosion and expansion operations are performed to correct the saliency, resulting in a corrected saliency map. ; ; in and These are the corrosion and expansion functions, respectively; For the saliency map Threshold segmentation is performed to obtain the salient bounding boxes of all connected regions. ,in This represents the number of connected regions. Select the bounding box of the connected region with the largest area as a trustworthy pseudo-bounding box for the current training image. And the score that is the same as the highest-scoring sample is represented as: ; The first filter is used to filter images, retaining only the salient bounding boxes of images belonging to a single category; The second filter is used to remove salient false bounding boxes that occupy more than half of the image and whose predicted values are discretely distributed.
4. The weakly supervised target detection method according to claim 3, characterized in that, The search results obtained using the multi-objective search method are used to filter the bounding box obtained by the saliency module, including: A multi-objective search method is used to obtain all candidate suggestions for the input image. ; For all candidate suggestions, sort them in descending order of score for each category, and select the suggestions with the highest scores as a set value to form a candidate pool. ; Iteratively, the highest-scoring suggestion from the candidate pool is entered into the pseudo-bounding box set. And remove from the candidate pool any pseudo-boundary boxes that intersect with the selected bounding box and have an IOU greater than 1. The suggestion continues until there are no more suggestions in the candidate pool or the set of pseudo-bounding boxes. The suggestion limit has been reached; Trustworthy pseudo-bounding boxes obtained by the saliency module At least and obtained multi-target search results If a match is found in one of the information, it is considered to be a valid saliency prior.
5. The weakly supervised target detection method according to claim 1, characterized in that, The area-guided sample weighting strategy includes: For each class c present in the image, a set of pseudo-bounding boxes is obtained during the k-branch. ,in For possible salience pseudo-bounding boxes, This represents the number of pseudo-bounding boxes obtained from multi-target search. Each of the pseudo-bounding boxes generates a corresponding positive sample cluster, and the positive sample clusters have the same pseudo-class label. and initial training weights ; Calculate the area of all samples in each cluster, and sort them in descending order of size. The ranking is represented as... Each of them Corresponding to the sample ranking, The number of positive samples; A distribution coefficient of [0, 1] is calculated based on the ranking. ; ; The weights of all samples in the cluster are calculated using a linear function to complete the allocation coefficients. To weight The mapping is represented as: ; in It is half the difference between the maximum and minimum weights; The area-guided weights decrease as the number of iterations during training increases, eventually approaching zero. A linear function is used to control these weights, as shown below: ; in It is the total number of iterations. It represents the proportion of iterations in the total number of iterations when the weight allocation disappears. It represents the current training round in the weight allocation process.
6. The weakly supervised target detection method according to claim 1, characterized in that, The network uses a hybrid loss function to constrain the MIL detection head, refinement branch, bounding box regression branch, and self-refinement module, as shown below: ; in It is the loss function of the MIL detection head. It is the loss function of the refinement branch. It is the loss function of the bounding box regression branch. It is the loss function of the self-refinement module. This represents the loss weights for the self-refinement module.
7. A weakly supervised target detection system, used to implement the method of claim 1, characterized in that, include: Feature extraction networks are used to extract features from input images; The self-refinement module enhances the extracted features; It is recommended to select a module to extract suggestions from the input image; The ROI Pooling layer extracts the proposed enhanced region features. The feature matrix is obtained through two fully connected layers. ; The calculation module performs calculations on two different dimensions: category and suggestion. get , through element product Receive all suggested scores; Get the image in the category The image score is obtained by summing all suggestions for that class: ; The suggested score, image score, saliency prior module, multi-objective search method, refinement branch, bounding box regression branch, and area-guided weighted strategy are associated with each other.
8. A terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it can be used to perform the method of any one of claims 1-6, or to run the system of claim 7.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program can be used to perform the method of any one of claims 1-6, or to run the system of claim 7.