Graffiti guided weakly supervised multi-modal pixel-level pseudo label generation method and system
By using a large SAM segmentation model and image transformation extension strategy, combined with the complementarity and consistency fusion of bimodal images, high-quality pixel-level pseudo-labels are generated. This solves the problems of high annotation cost and insufficient pseudo-label quality in multimodal image saliency detection, and achieves efficient model training and performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI UNIV
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
Existing multimodal image saliency detection models rely on a large amount of pixel-level labeled data, resulting in high labeling costs and insufficient quality of pseudo-label generation, which affects model performance.
We employ a large SAM segmentation model, combined with image transformation and cue expansion strategies, to generate high-quality pixel-level pseudo-labels through graffiti tags. We also utilize the complementarity and consistency fusion technique of bimodal images to generate high-quality pixel-level pseudo-labels.
It reduces annotation costs, generates high-quality pseudo-labels, and can effectively train RGB-D image saliency detection models. Its performance is close to that of fully supervised training, making it suitable for multimodal image saliency detection and other vision tasks.
Smart Images

Figure CN122244064A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of saliency detection, specifically to a weakly supervised multimodal pixel-level pseudo-label generation method and system for graffiti-guided graffiti. Background Technology
[0002] Saliency detection aims to identify and segment the most visually appealing objects in an image. As a fundamental task in computer vision, saliency detection research has driven the development of other visual tasks and is frequently applied in research on other tasks within the field of computer vision, such as image quality assessment, image captioning, object tracking, and object segmentation. Furthermore, saliency detection has been widely adopted in practical applications, such as autonomous driving, intelligent surveillance, medical image analysis, and smartphone photography.
[0003] With the development of deep learning technology, significant progress has been made in saliency detection tasks based on multimodal RGB-D images. However, the superior performance of current models relies on a large amount of pixel-level labeled data, which is cumbersome and time-consuming to acquire. To achieve a balance between model performance and labeling costs and reduce reliance on dense labels, weakly supervised methods have received increasing attention. Weakly supervised methods train models based on sparse labels, such as image-level labels, bounding boxes, point labels, and graffiti labels. Among these, graffiti labels roughly mark salient objects and insignificant backgrounds with simple lines. Compared to image-level labels, they provide direct spatial location information; compared to bounding boxes, they do not introduce negative samples; and compared to point labels, they provide more supervisory information. Therefore, research based on graffiti labels has attracted considerable attention.
[0004] However, as a type of sparse labeling, graffiti labels still provide insufficient supervision information. To obtain dense labels, some methods generate pixel-level pseudo-labels based on graffiti labels. Existing methods for generating pixel-level pseudo-labels include generating pseudo-labels through visual information (such as consistency) and generating supervision information and iteratively updating it through self-training mechanisms. However, the pseudo-label generation of the above methods depends on models that have not been sufficiently trained and lack supervision information, making it difficult to guarantee the quality of the pseudo-labels. This limitation further affects the performance of models trained with pseudo-labels generated by the above methods.
[0005] A search revealed a Chinese patent application (application number 202310939807.0) that discloses a weakly supervised method for segmenting multimodal images and point cloud instances for autonomous driving. This method involves processing point cloud data using 2D bounding box labels from image data to obtain coarse point cloud pseudo-label data; then, a pseudo-label generator is used to process this coarse pseudo-label data to obtain trained point cloud pseudo-label data. However, this method does not involve SAM (Symptom Aspect Ratio), image transformation, cue extension, or consistency fusion, so the supervision of the obtained pseudo-label data remains insufficient. Summary of the Invention
[0006] To address the shortcomings of existing technologies, the purpose of this application is to provide a graffiti-guided pixel-level pseudo-label generation method and system for weakly supervised multimodal image saliency detection.
[0007] The first aspect of this application provides a weakly supervised multimodal pixel-level pseudo-label generation method guided by graffiti, comprising: Obtain RGB images, depth images, and corresponding doodle labels; The depth image is encoded into an HHA depth image; Using the SAM segmentation model and an image transformation strategy, the RGB image and the HHA depth image are segmented using the graffiti label as a cue, resulting in a set of bimodal segmentation mask pairs and their confidence scores. Using the SAM segmentation model, and employing a cue expansion strategy, the RGB image and the HHA depth image are segmented using the graffiti label and the expanded label generated therefrom as cues, respectively, to obtain another set of bimodal segmentation mask pairs and their confidence scores. The bimodal segmentation mask pair is fused based on the two sets of confidence scores, and the consistency score of the bimodal segmentation mask pair is calculated to obtain the initial pseudo-label. The initial pseudo-labels are sorted and weighted based on the consistency score to obtain the final pixel-level pseudo-labels.
[0008] Optionally, the SAM segmentation model is adopted, and through an image transformation strategy, using the graffiti label as a cue, the RGB image and the HHA depth image are segmented respectively to obtain a set of bimodal segmentation mask pairs and their confidence scores, including: The graffiti annotations are sampled in groups a, with N points sampled in each group, where a and N are both integers greater than or equal to 1; Perform various image transformations on the RGB image, the HHA depth image, and the sampling points; Input the transformed image and the original RGB image and HHA depth image into SAM, use the corresponding sampling points as prompts, and output the segmentation mask pair and the corresponding prediction confidence. The segmentation mask is subjected to an inverse transformation, averaged and fused, and the average confidence is calculated to obtain the bimodal segmentation mask pairs and mask confidence corresponding to the sampling points of group a.
[0009] Optionally, the step of using a cue extension strategy, with the graffiti label and its generated extended label as cues, to perform segmentation respectively, and obtain another set of bimodal segmentation mask pairs and their confidence levels, includes: Superpixel segmentation is performed on the RGB image and the HHA depth image respectively to obtain their respective superpixel segmentation maps; For the superpixel segmentation maps of the RGB image and the HHA depth image, the superpixels that intersect with the graffiti labels are marked as the corresponding labels of the graffiti, and the foreground mask and background mask of the RGB image and the HHA depth image are obtained respectively. Conflict optimization is performed on the foreground mask and the background mask to obtain optimized masks for the RGB image and the HHA depth image; The RGB image optimization mask and the HHA depth image optimization mask are subjected to consistency optimization to obtain the extended label mask; Perform a group of point sampling on the graffiti label and the extended label mask. For each group, sample N / 2 points on the graffiti label and N / 2 points on the extended label mask, and merge them into N sampling points. The RGB image and the HHA depth image are input into SAM. With N sampling points as prompts, the bimodal segmentation mask pairs and mask confidence scores corresponding to the a groups of sampling points are obtained.
[0010] Optionally, the conflict optimization of the foreground mask and the background mask to obtain optimized masks for the RGB image and the HHA depth image includes: In the foreground mask of an RGB image, pixels that are simultaneously marked as foreground and background are set to gray to prevent their use in subsequent cue sampling; Similarly, in the foreground mask of the HHA depth image, pixels that are simultaneously marked as foreground and background will also be set to gray. This yields optimized foreground masks that exclude conflict regions in the RGB and HHA depth images, respectively.
[0011] Optionally, the bimodal segmentation mask pair is fused based on the confidence levels of the two groups to calculate the consistency score of the bimodal segmentation mask pair and obtain the initial pseudo-label, including: The image transformation strategy generates a pairs of bimodal segmentation masks and their confidence scores, and the cue extension strategy generates another a pairs of bimodal segmentation masks and their confidence scores. For each of the above 2a bimodal segmentation mask pairs, the bimodal segmentation masks are weighted and fused with confidence, and the fused result is used as the initial pseudo-label. For each of the above 2a bimodal segmentation mask pairs, binarize them and calculate the cross-union ratio between the RGB and HHA masks as the consistency score; This yields 2a initial pseudo-labels and their corresponding consistency scores.
[0012] Optionally, the step of sorting and fusing the initial pseudo-labels based on the consistency score to obtain the final pixel-level pseudo-labels includes: The 2a initial pseudo-labels are arranged in descending order of consistency score; Select the k initial pseudo-labels with the highest consistency scores, where k is an integer, 1≤k≤2a; The k initial pseudo-labels are weighted and fused based on their consistency scores to obtain the final pixel-level pseudo-labels.
[0013] A second aspect of this application provides a training method for an RGB-D image saliency detection model, comprising: Determine the training sample set, including RGB image and depth image samples and their corresponding graffiti label sets; The pixel-level pseudo-label set corresponding to the RGB image is obtained by using any of the graffiti-guided weakly supervised multimodal pixel-level pseudo-label generation methods described above; The training samples are input into the RGB-D image saliency detection model to be trained to obtain the saliency map; Based on the saliency map and the pixel-level pseudo-labels, calculate the binary cross-entropy loss and cross-union ratio loss; Based on the saliency map and the graffiti labels, calculate the partial cross-entropy loss; The total loss is calculated based on the binary cross-entropy loss, cross-union ratio loss, and partial cross-entropy loss. Based on the total loss, an optimizer is used to iteratively update the structural parameters of the model, resulting in a trained RGB-D image saliency detection model.
[0014] A third aspect of this application provides a graffiti-guided weakly supervised multimodal pixel-level pseudo-tag generation system, comprising: The image acquisition module acquires RGB images, depth images, and corresponding graffiti labels; The depth image encoding module encodes the depth image into an HHA depth image; The image transformation-based segmentation mask generation module adopts the SAM segmentation model and uses the graffiti label as a prompt to segment the RGB image and the HHA depth image respectively, thereby obtaining a set of bimodal segmentation mask pairs and their confidence scores. The segmentation mask generation module based on cue extension adopts the SAM segmentation model and uses the cue extension strategy to segment the RGB image and the HHA depth image respectively with the graffiti label and the extended label generated therefrom as cue, to obtain another set of bimodal segmentation mask pairs and their confidence scores. The initial pseudo-label generation module fuses the two sets of bimodal segmentation mask pairs according to the confidence levels of the two sets, calculates the consistency score of the bimodal segmentation mask pairs, and obtains the initial pseudo-labels. The pixel-level pseudo-label generation module sorts and weights the initial pseudo-labels according to the consistency score to obtain the final pixel-level pseudo-labels.
[0015] A fourth aspect of this application provides a terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, can be used to perform the method described therein, or to run the system described therein.
[0016] A fifth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, can be used to perform the method described thereon or to run the system described thereon.
[0017] The graffiti-guided weakly supervised multimodal pixel-level pseudo-label generation method provided in this application utilizes the segmentation capability of the SAM segmentation large model, and adopts strategies based on graphic transformation and cue extension, as well as a consistency-based fusion method, to achieve the generation of high-quality pixel-level saliency pseudo-labels from graffiti-level annotations.
[0018] Other technical effects resulting from the additional features will be further illustrated in the corresponding embodiments. Attached Figure Description
[0019] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1 This is a flowchart illustrating a graffiti-guided weakly supervised multimodal pixel-level pseudo-tag generation method according to an exemplary embodiment; Figure 2 This is a flowchart illustrating an image transformation-based method driven by SAM according to an exemplary embodiment. Figure 3 This is a flowchart illustrating a SAM-driven prompt-based extension method according to an exemplary embodiment. Figure 4 This is a schematic diagram illustrating the process of generating initial and final pixel-level pseudo-tags according to an exemplary embodiment. Figure 5 This is a schematic diagram illustrating the structure of a graffiti-guided pixel-level pseudo-label generation system for weakly supervised multimodal image saliency detection, according to an exemplary embodiment. Detailed Implementation
[0020] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all fall within the protection scope of the present application. Parts not described in detail in the following embodiments can be implemented using existing technology.
[0021] In existing technologies, due to insufficient model training and inadequate supervision information, the segmentation ability of the models is limited, thus restricting the quality of pixel-level pseudo-labels for RGB-D image saliency generated based on doodle labels. To address these issues, this application provides a doodle-guided pixel-level pseudo-label generation method for weakly supervised multimodal image saliency detection, solving the aforementioned problems. Furthermore, the generated pixel-level pseudo-labels can be used for training existing fully supervised RGB-D image saliency detection models.
[0022] Reference Figure 1 As shown in one embodiment of this application, a graffiti-guided weakly supervised multimodal pixel-level pseudo-label generation method includes the following steps: S100, acquires RGB image, depth image and corresponding graffiti labels; Specifically, scribble-level labels are a low-cost, rough manual annotation method in computer vision.
[0023] RGB and depth images are paired, synchronously acquired data representing different modalities within the same scene. Their salience targets are consistent in semantics and spatial location. The graffiti-level labeled image pairs (paired RGB and depth images) share the same set of annotations.
[0024] S200 encodes the depth image into an HHA depth image; Specifically, an HHA depth image is a three-channel image representation obtained by encoding the original depth image. HHA stands for Horizontal Disparity, Height above Ground, and Angle with Gravity.
[0025] S300 employs a large SAM segmentation model. Through image transformation strategies and using graffiti labels as prompts, it segments RGB images and HHA depth images respectively, obtaining a set of bimodal segmentation mask pairs and their confidence scores. S400 employs the SAM segmentation model and uses a cue expansion strategy, with graffiti labels and their generated expansion labels as cues, to segment RGB images and HHA depth images respectively, resulting in another set of bimodal segmentation mask pairs and their confidence scores. S500: Based on the confidence levels of the two groups, the two sets of bimodal segmentation mask pairs are fused, and the consistency score of the bimodal segmentation mask pairs is calculated to obtain the initial pseudo-labels; S600 sorts and weights the initial pseudo-labels according to the consistency score to obtain the final pixel-level pseudo-labels.
[0026] The above embodiments of this application have high-quality pixel-level pseudo-label generation capability: the visual segmentation large model SAM (Segment Anything Model) has powerful segmentation capabilities after being trained under the supervision of a large amount of data. Combined with the application of its prompting engineering, prompts are generated by graffiti labels. Image transformation and prompt expansion strategies are adopted to generate complementary and more complete segmentation masks. Consistent label fusion further ensures the quality of generated labels.
[0027] Specifically, SAM is a large-scale, general-purpose image segmentation model proposed by Meta AI. It performs zero-shot, high-precision pixel-level segmentation of any target in an image based on any prompt provided by the user. Its inputs are the image and the prompt (such as a point, bounding box, or mask), and its outputs are the corresponding segmentation mask and its confidence score.
[0028] The embodiments described above in this application exhibit strong environmental robustness: the complementary design of multimodal data ensures stable performance of the entire segmentation method even in extreme environments. Confidence-based weighted fusion of the segmentation masks for bimodal images improves label quality.
[0029] The embodiments described above in this application have performance competitive with fully supervised training models: the model trained using a combination of generated high-quality pixel-level pseudo-labels and graffiti labels has performance competitive with models trained directly using pixel-level labels.
[0030] To obtain RGB images, depth images, and annotations, in some specific embodiments of this application, step S100 involves simultaneously acquiring RGB and depth images using a binocular camera or an RGB-D sensor, and then manually annotating the RGB images. As a type of sparse annotation, graffiti annotation significantly reduces the manpower and time costs compared to pixel-level annotation.
[0031] For example, an RGB-D sensor is a depth camera that uses structured light or ToF principles, such as the Intel RealSense D455, which supports the simultaneous output of high-resolution RGB images and depth maps.
[0032] To make the acquired image data more suitable for applications, in some embodiments, the acquired images are preprocessed to generate data pairs suitable for input. Preprocessing steps include methods such as noise reduction and alignment.
[0033] Example: Spatial alignment: The depth map is mapped to the RGB coordinate system based on calibration parameters, ensuring an alignment error of less than 0.5 pixels. Dynamic denoising: Adaptive bilateral filtering is applied to the RGB image, while temporal median filtering, i.e., a weighted average of three consecutive frames, is applied to the depth map.
[0034] The data obtained in the above embodiments serves as the input basis for subsequently obtaining pixel-level pseudo-tags.
[0035] To facilitate the segmentation of the depth map and enhance the expression of depth information, the depth image (single-channel depth map) obtained in S100 is encoded in step S200 to obtain a three-channel HHA depth map, namely a horizontal disparity map, a ground height map, and an angle map of the local surface normal and the inferred gravity direction.
[0036] The above embodiments convert the original single-channel depth information into a three-channel image with rich geometric semantics, making it directly compatible with visual models designed for RGB images (such as the Segment Anything Model) without modifying the model's input layer. At the same time, the HHA depth image enhances the model's understanding of scene geometry (such as surface orientation and object height), which helps improve the consistency between RGB modalities and depth modalities in segmentation tasks.
[0037] To overcome the sensitivity of SAM to image transformations, in some specific embodiments of this application, step S300 above employs a large SAM segmentation model. Using an image transformation strategy and graffiti labels as cues, it segments the RGB image and the HHA depth image respectively, obtaining a set of bimodal segmentation mask pairs and their confidence levels, such as... Figure 2 As shown, steps S31-S35 can be used, specifically: S31, sample the cue points of the graffiti label, sampling 5 sets of points, with 20 points in each set.
[0038] S32 performs various image transformations on RGB images and HHA depth images.
[0039] Specifically, image transformations include shrinking to half the original size, rotating 90 degrees counterclockwise, and flipping horizontally.
[0040] At the same time, the sampling points are also transformed accordingly.
[0041] After this step, combining the original image and 3 image transformations, there are a total of 4 sets of RGB images, HHA depth image pairs and their corresponding sampling points.
[0042] S33, input the above 4 sets of image pairs and their corresponding transformed cue points into the SAM model; for each cue point, run SAM on the 4 sets of image pairs respectively, perform a total of 5×4=20 segmentations, and obtain 20 RGB masks, 20 HHA masks and a total of 40 confidence scores. S34, perform the corresponding inverse image transformation on the segmentation mask obtained in S33.
[0043] S35, the four segmentation masks corresponding to each image are averaged and fused, and the four corresponding confidence scores are averaged to calculate the confidence score of the average fused mask. Since there are 5 cue points, 5 pairs of bimodal segmentation masks and their confidence scores are obtained, i.e. ,in, This represents the RGB modal segmentation mask corresponding to the l-th cue point. This represents the confidence level of the RGB modal segmentation mask corresponding to the l-th cue point. This represents the HHA depth mode segmentation mask corresponding to the l-th cue point. This represents the confidence level of the HHA deep modal segmentation mask corresponding to the l-th cue point.
[0044] The embodiments described above in this application compensate for the low quality of segmentation masks obtained by SAM processing under certain transformations by performing various image transformations on the images, thereby achieving higher quality segmentation masks.
[0045] Considering the small number of pixels occupied by graffiti annotations, in order to expand the potential sampling area of the cue points and obtain pixel-level pseudo-labels with more complete salient object structures, in some specific embodiments of this application, the SAM segmentation model is used in step S400 above. Through a cue expansion strategy, using the graffiti labels and their generated expanded labels as cues, the RGB image and HHA depth image are segmented respectively, resulting in another set of bimodal segmentation mask pairs and their confidence levels, such as... Figure 3As shown, steps S41-S46 can be used, specifically: S41 performs superpixel segmentation on RGB images and HHA depth images.
[0046] Specifically, the classic superpixel segmentation algorithm SLIC is adopted. This algorithm is based on K-means clustering to segment the image into superpixels with semantic information, and the RGB image and HHA depth image are segmented into 70 superpixels.
[0047] S42 uses graffiti labels to annotate superpixels.
[0048] Superpixels intersecting with foreground graffiti are labeled as foreground, and superpixels intersecting with background graffiti are labeled as background, thus obtaining the foreground mask corresponding to the superpixels of the RGB image. Background mask Foreground mask corresponding to superpixels of HHA depth image Background mask .
[0049] S43 uses the background mask to optimize foreground mask collisions.
[0050] In RGB foreground mask In the middle, will be simultaneously in The middle is marked as foreground and in Pixels marked as background are shown in gray to ensure that these conflicting areas are not sampled again in subsequent operations.
[0051] Similarly, in the HHA foreground mask In the middle, the above operations were also used to... right Optimize.
[0052] Obtain optimized foreground masks containing conflict regions from RGB and HHA depth images. .
[0053] S44 performs consistency optimization on the optimized foreground mask for the two modes.
[0054] Only keep and Pixels with the same label are marked as gray, while pixels with inconsistent labels are marked as gray. These gray areas, along with the gray areas marked in S43, are considered conflict areas to ensure that these conflict areas are not sampled in subsequent operations. This results in the extended label.
[0055] By employing the optimization strategies described above, the accuracy of the extended labels can be improved, thereby ensuring the accuracy of the sampled prompts.
[0056] S45, perform cue point sampling, sampling 5 sets of points, each set of cue points includes 10 points sampled from the graffiti label and 10 points sampled from the extended label mask.
[0057] S46, the RGB image and the HHA depth image are input into SAM, and the corresponding sampling points of S45 are used as prompts to obtain 5 sets of bimodal segmentation mask pairs and their confidence scores. The superscript "-" indicates the difference between another pair of bimodal segmentation masks corresponding to S35 and their confidence levels.
[0058] In the above embodiments of this application, a reasonable extended cue is obtained through a cue expansion method based on superpixel algorithm and related optimization strategies, thereby making the distribution of sampled cue points wider and the salient objects in the segmentation mask output by SAM more complete.
[0059] To fully utilize bimodal information, in some specific embodiments of this application, step S500 above involves fusing the bimodal segmentation mask pairs based on mask confidence, calculating the consistency score of the bimodal segmentation mask, and obtaining initial pseudo-labels, such as... Figure 4 As shown, steps S51-S52 can be used, specifically: S51, weighted fusion of the bimodal segmentation mask corresponding to each group of prompt points.
[0060] After passing through S300 and S400, a total of 10 sets of bimodal masks were obtained, each mask corresponding to a confidence score. In each set, the bimodal masks were weighted and fused based on the confidence scores. S52, in each group, binarize the bimodal segmentation mask.
[0061] S53, calculate the intersection-union ratio (CIRR) of each group of binarized bimodal segmentation masks as the consistency score. The calculation formula is as follows: in, Represents the binarization operation. It represents a very small constant value.
[0062] In the above embodiments of this application, the dual-modal segmentation mask is fused using confidence weighting, and the cross-union ratio (CUI) after binarization is calculated as the consistency score. Ten initial pseudo-labels (i.e., the mask after S51 weighted fusion) are obtained, and each initial pseudo-label corresponds to a consistency score.
[0063] A higher bimodal consistency score generally indicates a higher quality bimodal segmentation mask, leading to higher quality initial pseudo-labels generated through fusion. To obtain high-quality labels, in some specific embodiments of this application, step S600 above involves sorting and fusing the initial pseudo-labels based on the consistency score to obtain the final pixel-level pseudo-labels, such as... Figure 4 As shown, steps S61-S63 can be used, specifically: S61, based on consistency score and Sort the 10 initial pseudo-labels from highest to lowest, and obtain... rank represents r or h.
[0064] S62, select the 4 initial pseudo-labels with the highest consistency scores.
[0065] S63, for the four selected initial pseudo-labels, weighted fusion is performed based on their consistency scores, and finally binarization is performed to obtain the final label: The above embodiments effectively suppress single-modal noise and segmentation bias by introducing bimodal consistency scores as the screening and weighting criteria, significantly improving the accuracy and completeness of pseudo-labels; by fusing only high-confidence candidate results, the interference of low-quality pseudo-labels on the training process is avoided, thereby obtaining a saliency detection model with near-fully supervised performance under weak supervision conditions.
[0066] Based on the same technical concept, another embodiment of this application proposes a training method for an RGB-D image saliency detection model. This method combines pixel-level pseudo-labels and graffiti labels to train existing RGB-D image saliency detection models (such as CPNet, HENet, etc.). Specifically: First, design the loss function. Total loss. ,in, The saliency map output by the model. It's a graffiti label. It uses graffiti tags as the supervisory factor for the partial cross-entropy loss function. and These are weighted binary cross-entropy loss and weighted crossover ratio loss, respectively, supervised by pixel-level pseudo-labels.
[0067] Next, based on the total loss mentioned above, the AdamW optimizer is used to optimize the model parameters until convergence.
[0068] In one specific implementation, the training process of the RGB-D image saliency detection model adopts the following steps: S71, Determine the training sample set, which includes RGB image samples, depth image samples and corresponding graffiti labels; Specifically, the training sample set can consist of 1485 pairs of RGB-D images from the NJU2K dataset and 700 pairs of RGB-D images from the NLPR dataset, along with their corresponding graffiti labels.
[0069] S72, input the RGB-D image samples into the existing initial segmentation model to obtain the saliency map output by the initial segmentation model; S73, based on the output saliency map, graffiti labels and pixel-level pseudo-labels, calculate the partial cross-entropy loss, weighted binary cross-entropy loss and weighted intersection-union ratio loss, and calculate the total loss based on the above losses; S74, based on the total loss, uses the AdamW optimizer to iteratively update the structural parameters of the initial segmentation model, resulting in an RGB-D image saliency detection model.
[0070] The embodiments described above have significant practical implications and value. Compared to pixel-level labels, graffiti labels significantly reduce the cost and time of manual annotation, shortening the time to annotate a sample from minutes to seconds, making the construction of large-scale RGB-D saliency datasets feasible. Through these embodiments, graffiti labels are extended to pixel-level pseudo-labels, and high-quality, large-scale pixel-level labeled datasets further promote the development of RGB-D saliency research.
[0071] It is worth noting that this application also has generalization and scalability. The graffiti-guided weakly supervised multimodal pixel-level pseudo-label generation method proposed in this application is also applicable to some other visual tasks that rely on dense prediction, such as semantic segmentation and video object segmentation; or it can be applied to combinations of other modalities, such as RGB images and thermal imaging images, medical multimodal images, etc., when other modal images are available.
[0072] Based on the same technical concept, one embodiment of this application provides a graffiti-guided weakly supervised multimodal pixel-level pseudo-tag generation system 100, such as... Figure 5 As shown, it includes: Image acquisition module 110 acquires RGB images, depth images, and corresponding graffiti labels; The depth image encoding module 120 encodes the depth image into an HHA depth image; The image transformation-based segmentation mask generation module 130 adopts the SAM segmentation large model. Through the image transformation strategy, with graffiti labels as prompts, it segments the RGB image and HHA depth image respectively, and obtains a set of bimodal segmentation mask pairs and their confidence scores. The segmentation mask generation module 140 based on cue extension adopts the SAM segmentation large model. Through the cue extension strategy, it uses the graffiti label and the extended label generated by it as cues to segment the RGB image and HHA depth image respectively, and obtains another set of bimodal segmentation mask pairs and their confidence scores. The initial pseudo-label generation module 150 fuses the two sets of bimodal segmentation mask pairs according to the confidence of the two sets, calculates the consistency score of the bimodal segmentation mask pairs, and obtains the initial pseudo-labels. The pixel-level pseudo-label generation module 160 sorts and weights the initial pseudo-labels according to the consistency score to obtain the final pixel-level pseudo-labels.
[0073] The specific implementation techniques of each module / unit in the above examples of this application can be referred to the steps of the graffiti-guided pixel-level pseudo-label generation method for weakly supervised multimodal image saliency detection in the above embodiments, and will not be repeated here.
[0074] The preferred features in the above embodiments can be used individually in any embodiment, or in any combination thereof, provided they do not conflict with each other. Furthermore, parts not described in detail in the embodiments can be implemented using existing technologies.
[0075] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs and functional modules that implement the above methods), computer instructions, etc., and the aforementioned computer programs and computer instructions can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.
[0076] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.
[0077] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.
[0078] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.
[0079] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0080] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0081] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0082] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0083] The foregoing has described some specific embodiments of this application. It should be understood that this application is not limited to the specific embodiments described above, and those skilled in the art can make various modifications or variations within the scope of the claims, which do not affect the substantive content of this application. The above-described preferred features can be used in any combination without conflict.
Claims
1. A scribble-guided weakly supervised multi-modal pixel-level pseudo label generation method, characterized in that, include: Obtain RGB images, depth images, and corresponding doodle labels; The depth image is encoded into an HHA depth image; Using the SAM segmentation model and an image transformation strategy, with the graffiti label as a cue, the RGB image and the HHA depth image are segmented respectively, resulting in a set of bimodal segmentation mask pairs and confidence scores. Using the SAM segmentation model, and through a cue expansion strategy, the RGB image and the HHA depth image are segmented using the graffiti label and the expanded label generated therefrom as cues, respectively, to obtain another set of bimodal segmentation mask pairs and confidence scores; The two sets of bimodal segmentation mask pairs are fused based on the two sets of confidence scores, and the consistency score of the bimodal segmentation mask pairs is calculated to obtain the initial pseudo-labels. The initial pseudo-labels are sorted and weighted based on the consistency score to obtain pixel-level pseudo-labels.
2. The graffiti-guided weakly supervised multi-modal pixel-level pseudo label generation method according to claim 1, characterized in that, The method employs a large SAM segmentation model, using an image transformation strategy and the graffiti label as a cue to segment the RGB image and the HHA depth image respectively, resulting in a set of bimodal segmentation mask pairs and their confidence scores, including: The graffiti annotations are sampled in groups a, with N points sampled in each group, where a and N are both integers greater than or equal to 1; Perform various image transformations on the RGB image, the HHA depth image, and the sampling points; Input the transformed image and the original RGB image and HHA depth image into SAM, use the corresponding sampling points as prompts, and output the segmentation mask pair and the corresponding prediction confidence. The segmentation mask is subjected to an inverse transformation, averaged and fused, and the average confidence is calculated to obtain the bimodal segmentation mask pairs and mask confidence corresponding to the sampling points of group a.
3. The graffiti-guided weakly supervised multi-modal pixel-level pseudo label generation method according to claim 1, characterized in that, The step involves using a cue extension strategy, with the graffiti label and its generated extended label as cues, to segment the data, obtaining another set of bimodal segmentation mask pairs and their confidence levels, including: Superpixel segmentation is performed on the RGB image and the HHA depth image respectively to obtain their respective superpixel segmentation maps; For the superpixel segmentation maps of the RGB image and the HHA depth image, the superpixels that intersect with the graffiti labels are marked as the corresponding labels of the graffiti, and the foreground mask and background mask of the RGB image and the HHA depth image are obtained respectively. Conflict optimization is performed on the foreground mask and the background mask to obtain optimized masks for the RGB image and the HHA depth image; The RGB image optimization mask and the HHA depth image optimization mask are subjected to consistency optimization to obtain the extended label mask; Perform a group of point sampling on the graffiti label and the extended label mask. For each group, sample N / 2 points on the graffiti label and N / 2 points on the extended label mask, and merge them into N sampling points. The RGB image and the HHA depth image are input into SAM. With N sampling points as prompts, the bimodal segmentation mask pairs and mask confidence scores corresponding to the a groups of sampling points are obtained.
4. The graffiti-guided weakly supervised multi-modal pixel-level pseudo label generation method according to claim 3, characterized in that, The step of performing conflict optimization on the foreground mask and the background mask to obtain optimized masks for the RGB image and the HHA depth image includes: In the foreground mask of an RGB image, pixels that are simultaneously marked as foreground and background are set to gray to prevent their use in subsequent cue sampling; Similarly, in the foreground mask of the HHA depth image, pixels that are simultaneously marked as foreground and background will also be set to gray. This yields optimized foreground masks that exclude conflict regions in the RGB and HHA depth images, respectively.
5. The graffiti-guided weakly supervised multi-modal pixel-level pseudo label generation method according to claim 1, characterized in that, The bimodal segmentation mask pairs are fused based on the confidence levels of the two groups, and the consistency score of the bimodal segmentation mask pairs is calculated to obtain initial pseudo-labels, including: The image transformation strategy generates a pairs of bimodal segmentation masks and their confidence scores, and the cue extension strategy generates another a pairs of bimodal segmentation masks and their confidence scores. For each of the above 2a bimodal segmentation mask pairs, the bimodal segmentation masks are weighted and fused with confidence, and the fused result is used as the initial pseudo-label. For each of the above 2a bimodal segmentation mask pairs, binarize them and calculate the cross-union ratio between the RGB and HHA masks as the consistency score; This yields 2a initial pseudo-labels and their corresponding consistency scores.
6. The scribble-guided weakly supervised multi-modal pixel-level pseudo label generation method according to claim 5, characterized in that, The step of sorting and fusing the initial pseudo-labels based on the consistency score to obtain the final pixel-level pseudo-labels includes: The 2a initial pseudo-labels are arranged in descending order of consistency score; Select the k initial pseudo-labels with the highest consistency scores, where k is an integer, 1≤k≤2a; The k initial pseudo-labels are weighted and fused based on their consistency scores to obtain the final pixel-level pseudo-labels.
7. A training method for an RGB-D image saliency detection model, characterized in that, include: Determine the training sample set, including RGB image and depth image samples and their corresponding graffiti label sets; The pixel-level pseudo-label set corresponding to the RGB image is obtained by using the method described in any one of claims 1-6; The training samples are input into the RGB-D image saliency detection model to be trained to obtain the saliency map; Based on the saliency map and the pixel-level pseudo-labels, calculate the binary cross-entropy loss and cross-union ratio loss; Based on the saliency map and the graffiti labels, calculate the partial cross-entropy loss; The total loss is calculated based on the binary cross-entropy loss, cross-union ratio loss, and partial cross-entropy loss. Based on the total loss, an optimizer is used to iteratively update the structural parameters of the model, resulting in a trained RGB-D image saliency detection model.
8. A graffiti-guided weakly supervised multimodal pixel-level pseudo-tag generation system, characterized in that, include: The image acquisition module acquires RGB images, depth images, and corresponding graffiti labels. The depth image encoding module encodes the depth image into an HHA depth image; The image transformation-based segmentation mask generation module adopts the SAM segmentation model and uses the graffiti label as a prompt to segment the RGB image and the HHA depth image respectively, thereby obtaining a set of bimodal segmentation mask pairs and their confidence scores. The segmentation mask generation module based on cue extension adopts the SAM segmentation model and uses the cue extension strategy to segment the RGB image and the HHA depth image respectively with the graffiti label and the extended label generated therefrom as cue, to obtain another set of bimodal segmentation mask pairs and their confidence scores. The initial pseudo-label generation module fuses the two sets of bimodal segmentation mask pairs according to the confidence levels of the two sets, calculates the consistency score of the bimodal segmentation mask pairs, and obtains the initial pseudo-labels. The pixel-level pseudo-label generation module sorts and weights the initial pseudo-labels according to the consistency score to obtain the final pixel-level pseudo-labels.
9. A terminal, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it can be used to perform the method of any one of claims 1-7, or to run the system of claim 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program can be used to perform the method of any one of claims 1-7, or to run the system of claim 8.