Weakly supervised remote sensing image object detection method based on multi-modal pseudo label guidance and adaptive fusion
By using a multimodal pseudo-label guidance and adaptive fusion method, pseudo-RGB-D labels for remote sensing images are generated using a depth estimation model and a dual-stream encoder. This solves the problems of large label dependence and subjective bias in traditional methods, and achieves efficient and high-precision target detection of remote sensing images.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- KUNMING UNIV OF SCI & TECH
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-23
Smart Images

Figure CN121482371B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image target detection technology, and in particular to a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion. Background Technology
[0002] Weak supervision-camouflaged object detection (WS-COD) aims to reduce reliance on large amounts of pixel-level labeled data and is suitable for detecting camouflaged targets in complex environments. Traditional camouflaged target detection methods require a large amount of high-quality annotations, which are labor-intensive and susceptible to subjective bias; weak supervision methods, on the other hand, use coarse annotations (such as graffiti-level or bounding boxes) to reduce annotation costs.
[0003] However, the high similarity between camouflaged targets and the background in color, texture, and boundaries poses a significant challenge to weak supervision. Camouflaged targets blend seamlessly into these aspects, making it difficult to capture subtle differences. Recent research has focused on improving weak supervision through adaptive mechanisms, multi-scale feature extraction, and boundary guidance. Other strategies include going beyond human vision, multimodal fusion, and multi-task joint modeling. Despite these advances, effectively utilizing high-level visual features for weakly supervised sparse data classification remains challenging.
[0004] In view of this, this application proposes a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion, aiming to achieve accurate detection of remote sensing images under weak supervision. Summary of the Invention
[0005] The main purpose of this application is to provide a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion, aiming to solve the problem of how to achieve accurate detection of remote sensing images under weak supervision.
[0006] To achieve the above objectives, this application provides a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion, characterized in that the method includes the following steps:
[0007] S10, a depth map corresponding to the acquired remote sensing image is generated through a depth estimation model, and a dual-stream encoder is used to extract the RGB features and depth features from the depth map respectively;
[0008] S20, the RGB features and the depth features are fused through the cross-modal adaptive relationship fusion module to obtain fused features;
[0009] S30, uses the cross-modal pseudo-supervised label mapping module to generate RGB pseudo-labels and deep feature pseudo-labels, and aggregates them to form pseudo-RGB-D labels;
[0010] S40, based on the pseudo-RGB-D labels, and using a unified structure loss and depth dual loss to perform supervised training on the main branch and auxiliary branch;
[0011] S50 outputs the detection results of camouflaged targets.
[0012] Optionally, S20 includes:
[0013] S21, Convolve the RGB features and the depth features to obtain the query vector Q, the key vector K and the value vector V;
[0014] S22, based on the cross-modal spatial view attention branch, calculates the attention score between each feature coding block and performs position decoding to obtain the spatial view features. :
[0015]
[0016] in These are individual headers for the query vector, key vector, and value vector, respectively. It represents the number of heads, D represents the number of pixel markers and the embedding dimension; e= It is the scaling factor; This represents the splicing operation of all heads. It is the output projection matrix of the corresponding spatial view, used to fuse the various heads;
[0017] S23, Based on the cross-modal channel view attention branch, calculate the similarity between different feature channels to obtain channel view features. :
[0018]
[0019] in It is a scaling factor. It is the output projection matrix corresponding to the channel view, used for fusing the header;
[0020] S24, the spatial view features and the channel view features Through learnable weights and conduct Integration to generate cross-modal attention deep feature output :
[0021]
[0022] In the formula, and The output features of the cross-modal space view attention branch and the cross-modal channel view attention branch are represented by learnable weights. and Adjustment and The contribution ratio, generating cross-modal attention output. ;
[0023] S25, Similarly, determine the cross-modal attention RGB feature output corresponding to the RGB feature. :
[0024]
[0025] in The cross-modal attention output represents the propagation of depth information back to RGB features. and These represent the spatial attention output and channel view attention output, respectively, which capture depth information into the RGB feature flow. and Two additional learnable weights are used for adjustment and The proportion of contribution;
[0026] S26, each Its corresponding cross-modal attention output Enhancements are performed, and then these enhanced feature maps are concatenated to generate the final output fused feature. :
[0027]
[0028] In the formula, Cat(.,.) represents the feature map concatenation operation. This represents the original depth feature map extracted by the two-stream encoder. This represents the original RGB feature map extracted by the dual-stream encoder.
[0029] Optionally, S30 includes:
[0030] S31, the image is divided into superpixel regions of a preset size; wherein, if the number of foreground pixels in a superpixel region exceeds the number of background pixels, and the number of foreground pixels is greater than 1, then the superpixel is marked as foreground; otherwise, it is marked as background.
[0031] S32, by spreading the sparse graffiti annotation Y to the superpixel regions generated from the RGB image and the superpixel regions generated from the depth map respectively, the region with more foreground graffiti pixels than background is labeled as the foreground, i.e., pseudo-RGB labeling based on RGB texture boundaries ( Otherwise, mark it as background. That is, pseudo-depth labels based on depth structure boundaries ( );
[0032] S33, employing a pixel-adaptive thinning method on the pseudo-RGB label. and the pseudo-depth label Smooth and refine the data, then aggregate it to form pseudo-RGB-D labels. :
[0033]
[0034] Optionally, S40 includes:
[0035] S41, employing a unified structural loss To monitor the output of the main branch:
[0036]
[0037] In the formula This represents the binary cross-entropy loss, used to optimize pixel-level accuracy, focusing on the overall overlap between the predicted region and the pseudo-labels that are the real regions. This represents the predicted output of the i-th layer. This indicates the aggregation of pseudo-RGB-D tags; The intersection-union ratio loss is used to optimize the accuracy of region overlap, focusing on the overall degree of overlap between the predicted region and the pseudo-labels that are the real regions. This is a structural similarity loss used to optimize the structural similarity of images, ensuring that the predicted results are structurally consistent with the pseudo-labels. `i` and `∑` represent multi-scale deep supervision indexes;
[0038] S42, using deep double loss To supervise the auxiliary branch:
[0039]
[0040] In the formula, The binary cross-entropy loss is used to distinguish camouflaged objects from their surrounding depth values in pseudo-depth labels. The depth smoothing loss is used to mask background pixels by multiplying pseudo-RGB-D labels with depth prediction, ensuring smooth internal structures of objects and reducing high-frequency noise.
[0041] in:
[0042]
[0043]
[0044] In the formula, This represents the final prediction graph generated by the deep branch. Indicates pseudo-depth labels The normal vector is represented by p, and the pixel within the object region is represented by p. These are neighboring pixels, where cosine is the cosine similarity between two vectors, and q represents the neighborhood pixels of pixel p. This represents the gradient vector of the neighborhood of pixel p. Indicates that pixel p is in the prediction map Gradient information on;
[0045] in:
[0046]
[0047]
[0048] In the formula, and For the Sobel operator, and This represents the masked depth prediction values at pixels p and q;
[0049] S43, Based on Unified Structural Loss Both depth loss Supervised training is performed on the main branch and auxiliary branches.
[0050] This application has at least the following beneficial effects:
[0051] It effectively solves the problems of insufficient utilization of multimodal information, large pseudo-label noise, and insufficient fusion of global and local information under weak supervision, and achieves high-precision and high-efficiency salient target detection on multiple datasets. Attached Figure Description
[0052] Figure 1 This is a textual flowchart illustrating the weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion involved in the embodiments of this application;
[0053] Figure 2 This is a schematic diagram of the graphic process for weakly supervised remote sensing image target detection based on multimodal pseudo-label guidance and adaptive fusion, as described in the embodiments of this application.
[0054] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0055] To better understand the above technical solutions, exemplary embodiments of this disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of this disclosure to those skilled in the art.
[0056] First Embodiment
[0057] Reference Figure 1 and Figure 2 The text and graphic flowcharts of a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion are shown respectively. This embodiment provides a weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion, which includes the following steps:
[0058] S10, a depth map corresponding to the acquired remote sensing image is generated through a depth estimation model, and a dual-stream encoder is used to extract the RGB features and depth features from the depth map respectively;
[0059] In this embodiment, the collected remote sensing images are first input into the Depth Anything model in the field of depth estimation. By utilizing large-scale unlabeled data and the semantic prior knowledge of the pre-trained encoder, the model is forced to actively learn additional visual knowledge to obtain robust depth prediction capabilities and obtain the corresponding depth map. The RGB features and depth features in the depth map are then extracted respectively.
[0060] For example, we utilize the state-of-the-art technology in this field, DepthAnything V2, which provides us with an accurate depth map. Given an input RGB image of size [size missing], [details missing]. Here, H and W represent the spatial resolution of the image, respectively. Our goal is to leverage depth maps from multimodal data to facilitate the modeling of pseudo-depth labels, while combining spatial information to achieve more accurate segmentation of camouflaged objects. Initially, the input image I is fed into a monocular depth estimation network to generate a high-precision depth map D.
[0061] I and D are processed separately by a two-stream encoder to extract RGB and spatial features, respectively. Then, the RGB and spatial features are fused and decoded, serving as the main branch output Fout, designed to extract the boundary information of the camouflaged object. Simultaneously, the spatial features are decoded separately as the auxiliary branch output Dout to capture the position of the camouflaged object. Finally, pseudo-RGB labels and pseudo-depth labels are generated through graffiti annotation and the combined effect of I and D, converging into pseudo-RGB-D labels. These pseudo-RGB-D labels are used to supervise the entire network, while the pseudo-depth labels provide further auxiliary supervision to improve performance.
[0062] S20, the RGB features and the depth features are fused through the cross-modal adaptive relationship fusion module to obtain fused features;
[0063] In this embodiment, the input RGB feature map and depth feature map Convolution is performed to obtain the query vector Q, key vector K, and value vector V. Subsequently, positional encoding is applied to achieve block-level patch re-embedding; by recombining the query vector Q, key vector K, and value vector V, the attention mechanism operates at the block level, enabling each block to perform attention calculations independently. Finally, we utilize the multi-head self-attention mechanism in the transformer to calculate the attention score between each feature encoding block and perform positional decoding. It is evident that after interaction with different information sources, attention is more prominent near the edges of the camouflaged object. This can be expressed as:
[0064]
[0065] in and These are individual headers for the query, key, and value, respectively. It represents the number of heads, where D represents the number of pixel markers and the embedding dimension. e= It is the scaling factor. This represents the splicing operation of all heads. It is the output projection matrix of the corresponding spatial view, used to fuse these heads.
[0066] Cross-modal channel view attention is similar to cross-modal spatial view attention, the main difference being the omission of positional encoding and a shift in focus from spatial similarity to the similarity between independent feature channels. This method does not compare spatial locations within a feature sequence, but rather calculates the similarity between different feature channels, thereby capturing the complex relationships between channels in a multimodal feature map. The mathematical expression is:
[0067]
[0068] in It is the scaling factor. It is the output projection matrix corresponding to the channel view, used to fuse these headers.
[0069] Output features from cross-modal spatial view attention branch and cross-modal channel view attention branch and Through learnable weights and conduct Integration to generate cross-modal attention output This represents the propagation of RGB information to deeper features. We also aim to capture the propagation of deeper information to RGB features, denoted as... Therefore, we adjust the variables in the above formula accordingly: And introduce additional learnable weights. and To achieve this, we augment each original feature map with its corresponding cross-modal attention output, and then concatenate these augmented feature maps to generate the final output. Its expression is as follows:
[0070]
[0071] Summarized as follows:
[0072] S21, Convolve the RGB features and the depth features to obtain the query vector Q, the key vector K and the value vector V;
[0073] S22, based on the cross-modal spatial view attention branch, calculates the attention score between each feature coding block and performs position decoding to obtain the spatial view features. :
[0074]
[0075] in These are individual headers for the query vector, key vector, and value vector, respectively. It represents the number of heads, D represents the number of pixel markers and the embedding dimension; e= It is the scaling factor; This represents the splicing operation of all heads. It is the output projection matrix of the corresponding spatial view, used to fuse the various heads;
[0076] S23, Based on the cross-modal channel view attention branch, calculate the similarity between different feature channels to obtain channel view features. :
[0077]
[0078] in It is a scaling factor. It is the output projection matrix corresponding to the channel view, used for fusing the header;
[0079] S24, the spatial view features and the channel view features Through learnable weights and conduct Integration to generate cross-modal attention deep feature output :
[0080]
[0081] In the formula, and The output features of the cross-modal space view attention branch and the cross-modal channel view attention branch are represented by learnable weights. and Adjustment and The contribution ratio, generating cross-modal attention output. ;
[0082] S25, Similarly, determine the cross-modal attention RGB feature output corresponding to the RGB feature. :
[0083]
[0084] in The cross-modal attention output represents the propagation of depth information back to RGB features. and These represent the spatial attention output and channel view attention output, respectively, which capture depth information into the RGB feature flow. and Two additional learnable weights are used for adjustment and The proportion of contribution;
[0085] S26, each Its corresponding cross-modal attention output Enhancements are performed, and then these enhanced feature maps are concatenated to generate the final output fused feature. :
[0086]
[0087] In the formula, Cat(.,.) represents the feature map concatenation operation. This represents the original depth feature map extracted by the two-stream encoder. This represents the original RGB feature map extracted by the dual-stream encoder.
[0088] S30, uses the cross-modal pseudo-supervised label mapping module to generate RGB pseudo-labels and deep feature pseudo-labels, and aggregates them to form pseudo-RGB-D labels;
[0089] In this embodiment, a cross-modal pseudo-supervised label mapping module (CPLM) is used to generate and aggregate RGB and depth pseudo-labels to form pseudo-RGB-D labels. This step aims to create comprehensive pseudo-RGB-D labels by fusing pseudo-RGB and pseudo-depth labels to supervise the network's output. Furthermore, considering the unique challenges posed by camouflaged objects, pseudo-depth labels are introduced to refine the final prediction results of the depth branch, thereby achieving more accurate localization of camouflaged objects.
[0090] Specifically, and optionally, in this embodiment, we introduce a foreground-background selective expansion module to generate bimodal pseudo-labels. Specifically, the image is first segmented into superpixel regions of appropriate size. If the number of foreground pixels within a superpixel exceeds the number of background pixels, and the number of foreground pixels is greater than one, then the superpixel is marked as foreground. Otherwise, it is marked as background. Thus, by expanding the RGB and depth superpixel annotations to the Y region, corresponding pseudo-RGB labels are generated. and pseudo-depth tags To achieve a balance between efficiency and performance.
[0091] Furthermore, a pixel-adaptive thinning method, PAR, is used to smooth and thin the two pseudo-labels, ultimately generating an aggregated pseudo-RGB-D label. The pixel-adaptive thinning method defines a convolutional kernel that combines RGB and spatial information and is computed based on the neighboring locations of pixels. This kernel is applied to the average value through adaptive convolution iterations, effectively correcting for false labels.
[0092]
[0093] The decoded output from CARF is used as the main branch prediction, while the decoded output of the deep feature branch is used as the auxiliary prediction for supervision. A prediction head is applied and upsampled on the four decoded outputs of the main branch to achieve deep supervision and monitor the output of the initial layer in real time.
[0094] S40, based on the pseudo-RGB-D labels, and using a unified structure loss and depth dual loss to perform supervised training on the main branch and auxiliary branch;
[0095] In this embodiment, a unified structure loss is used to supervise the output of the main branch, and a deep dual loss is used to supervise the output of the main branch and auxiliary branches to complete the network supervised training.
[0096] Specifically, S41 adopts a unified structural loss. To monitor the output of the main branch:
[0097]
[0098] In the formula This represents the binary cross-entropy loss, used to optimize pixel-level accuracy, focusing on the overall overlap between the predicted region and the ground truth region (pseudo-label). This represents the predicted output of the i-th layer. This indicates the aggregation of pseudo-RGB-D tags. The intersection-union ratio loss is used to optimize the accuracy of region overlap, focusing on the overall degree of overlap between the predicted region and the real region (pseudo-label). This is a structural similarity loss used to optimize the structural similarity of images, ensuring that the predicted results are structurally consistent with the pseudo-labels. `i` and `∑` represent multi-scale deep supervision indices.
[0099] In this embodiment, the depth double loss includes two parts: (1) BCE loss, namely binary cross-entropy loss, which uses pseudo-depth labels to distinguish the depth values of the disguised object from those around it; (2) depth smoothing loss, which uses pseudo-RGB-D labels to multiply with depth prediction to mask background pixels, ensure smooth internal structure of the object, and reduce high-frequency noise. The expression is:
[0100]
[0101] in:
[0102]
[0103]
[0104] In the formula, This represents the final prediction graph generated by the deep branch. Indicates pseudo-depth labels The normal vector is represented by p, and the pixel within the object region is represented by p. These are neighboring pixels, where cosine is the cosine similarity between two vectors, and q represents the neighborhood pixels of pixel p. This represents the gradient vector of the neighborhood of pixel p. Indicates that pixel p is in the prediction map Gradient information on;
[0105] in:
[0106]
[0107]
[0108] In the formula, and For the Sobel operator, and This represents the masked depth prediction values at pixels p and q;
[0109] S50 outputs the detection results of camouflaged targets.
[0110] In the technical solution provided in this embodiment, the bimodal features are dynamically fused through the cross-modal adaptive relation fusion module (CARF); then, the RGB and depth pseudo-labels generated and aggregated by the cross-modal pseudo-supervised label mapping module (CPLM) are used to conduct supervised training on the main branch and auxiliary branch using a unified structural loss and depth dual loss, so as to solve the problems of insufficient utilization of multimodal information, large pseudo-label noise, and insufficient fusion of global and local information under weak supervision.
[0111] Verification of Examples
[0112] Based on the first embodiment, this embodiment verifies the detection performance of the weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion proposed in this application:
[0113] First, in the data preparation phase, we used the S-COD training set, introduced by He et al., which marked the first dataset specifically tailored for COD tasks under a weakly supervised framework. For the test set, we used four widely accepted datasets: CHAMELEON, CAMO, COD10K, and NC4K.
[0114] The initialization of model parameters specifically includes the parameters of the dual-stream encoder, CARF module, and CPLM module.
[0115] Then, PVT-v2-B2 was used as the backbone network, initialized with pre-trained weights, while RGB images and depth maps used separate backbone networks. Under this setup, the overall computational complexity was 182.13 GFLOPs, and the total number of parameters was 53.33M. For comprehensive comparison, we also provided an SPMCNet variant using a ResNet50 backbone. The input image size was resized to 352×352 pixels, and the number of superpixels was set to 100. To reduce overfitting, random data augmentation was employed, including scaling, flipping, rotation, and color enhancement. The model was trained for 50 epochs using the Nadam optimizer with a batch size of 8.
[0116] In this embodiment, we use four commonly used indicators for evaluation, including structural indicators. Average E-measurement Adaptive F-measurement and mean absolute error
[0117] The following is a quantitative comparison of the model across four benchmark tests. "F" represents a fully supervised method, "U" represents an unsupervised method, "S" represents a scribble-supervised method, and "P" represents a point-supervised method. The best results for both scribble-supervised and point-supervised methods are highlighted in bold.
[0118] Table 1. Quantitative comparison of the model across four benchmark tests
[0119]
[0120] Table 1 (continued)
[0121]
[0122] Experimental results show that the method proposed in this invention outperforms multiple methods on multiple datasets, verifying its effectiveness and superiority.
[0123] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0124] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A weakly supervised remote sensing image target detection method based on multimodal pseudo-label guidance and adaptive fusion, characterized in that, The method includes the following steps: S10, a depth map corresponding to the acquired remote sensing image is generated through a depth estimation model, and a dual-stream encoder is used to extract the RGB features and depth features from the depth map respectively; S20, the RGB features and the depth features are fused through the cross-modal adaptive relationship fusion module to obtain fused features; S30, uses the cross-modal pseudo-supervised label mapping module to generate RGB pseudo-labels and deep feature pseudo-labels, and aggregates them to form pseudo-RGB-D labels; S40, based on the pseudo-RGB-D labels, and using a unified structure loss and depth dual loss to perform supervised training on the main branch and auxiliary branch; S50 outputs the camouflage target detection results; S20 includes: S21, Convolve the RGB features and the depth features to obtain the query vector Q, the key vector K and the value vector V; S22, based on the cross-modal spatial view attention branch, calculates the attention score between each feature coding block and performs position decoding to obtain the spatial view features. : ; in These are the individual headers of the query vector, key vector, and value vector, respectively; e= It is a scaling factor. It represents the number of heads, and D represents the number of pixel markers and the embedding dimension; This represents the splicing operation of all heads. It is the output projection matrix of the corresponding spatial view, used to fuse the various heads; S23, Based on the cross-modal channel view attention branch, calculate the similarity between different feature channels to obtain channel view features. : ; in It is a scaling factor. It is the output projection matrix corresponding to the channel view, used for fusing the header; S24, the spatial view features and the channel view features Through learnable weights and conduct Integration to generate cross-modal attention deep feature output : ; In the formula, and The output features of the cross-modal space view attention branch and the cross-modal channel view attention branch are represented by learnable weights. and Adjustment and The contribution ratio, generating cross-modal attention output. ; S25, Similarly, determine the cross-modal attention RGB feature output corresponding to the RGB feature. : ; in The cross-modal attention output represents the propagation of depth information back to RGB features. and These represent the spatial attention output and channel view attention output, respectively, which capture depth information into the RGB feature flow. and Two additional learnable weights are used for adjustment and The proportion of contribution; S26, each Its corresponding cross-modal attention output Enhancements are performed, and then these enhanced feature maps are concatenated to generate the final output fused feature. : ; In the formula, Cat(.,.) represents the feature map concatenation operation. This represents the original depth feature map extracted by the two-stream encoder. This represents the original RGB feature map extracted by the dual-stream encoder; S30 includes: S31, the image is divided into superpixel regions of a preset size; wherein, if the number of foreground pixels in a superpixel region exceeds the number of background pixels, and the number of foreground pixels is greater than 1, then the superpixel is marked as foreground; otherwise, it is marked as background. S32, by spreading the sparse graffiti labels Y to the superpixel regions generated from the RGB image and the superpixel regions generated from the depth map respectively, areas with more foreground graffiti pixels than background are labeled as foreground, i.e., pseudo-RGB labels based on RGB texture boundaries. Otherwise, it is marked as background, i.e., a pseudo-depth label based on the depth structure boundary. ; S33, employing a pixel-adaptive thinning method on the pseudo-RGB label. and the pseudo-depth label Smooth and refine the data, then aggregate it to form pseudo-RGB-D labels. : 。 2. The method as described in claim 1, characterized in that, S40 includes: S41, employing a unified structural loss To monitor the output of the main branch: ; In the formula This represents the binary cross-entropy loss, used to optimize pixel-level accuracy, focusing on the overall overlap between the predicted region and the pseudo-labels that are the real regions. This represents the predicted output of the i-th layer. This indicates the aggregation of pseudo-RGB-D tags; The intersection-union ratio loss is used to optimize the accuracy of region overlap, focusing on the overall degree of overlap between the predicted region and the pseudo-labels that are the real regions. It is a structural similarity loss used to optimize the structural similarity of images, ensuring that the predicted results are structurally consistent with the pseudo-labels; i and ∑ represent multi-scale deep supervision index; S42, using deep double loss To supervise the auxiliary branch: ; In the formula, The binary cross-entropy loss is used to distinguish camouflaged objects from their surrounding depth values in pseudo-depth labels. The depth smoothing loss is used to mask background pixels by multiplying pseudo-RGB-D labels with depth prediction, ensuring smooth internal structures of objects and reducing high-frequency noise. in: ; ; In the formula, This represents the final prediction graph generated by the deep branch. Indicates pseudo-depth labels; The normal vector is represented by p, and the pixel within the object region is represented by p. These are neighboring pixels, where cosine is the cosine similarity between two vectors, and q represents the neighborhood pixels of pixel p. This represents the gradient vector of the neighborhood of pixel p. Indicates that pixel p is in the prediction map Gradient information on; in: ; ; In the formula, and For the Sobel operator, and This represents the masked depth prediction values at pixels p and q; S43, Based on Unified Structural Loss Both depth loss Supervised training is performed on the main branch and auxiliary branches.