RGB-D saliency detection method based on progressive weighted decoding

By combining multi-scale self-attention feature enhancement and cross-modal feature fusion with progressive weighted decoding, the problem of insufficient multi-scale feature extraction and information interaction in RGB-D saliency detection is solved, thereby improving detection accuracy and completeness.

CN118587449BActive Publication Date: 2026-06-19JIANGNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JIANGNAN UNIV
Filing Date
2024-06-07
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing RGB-D saliency detection methods tend to ignore differences at different scales during feature extraction, making it difficult to effectively extract multi-scale features. Furthermore, the interaction of modal feature information is insufficient during the decoding process, leading to decreased detection performance and information loss.

Method used

A multi-scale self-attention feature enhancement module and a cross-modal feature fusion module are adopted, combined with a progressive weighted fusion decoder. Multi-scale features are captured through a self-attention mechanism, and different modal feature information is fused by dynamic weights and decoded layer by layer to improve detection accuracy.

Benefits of technology

It effectively extracts multi-scale features, enhances information interaction, and improves the accuracy and completeness of saliency detection, solving the problems of incomplete detection and low accuracy in existing methods.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118587449B_ABST
    Figure CN118587449B_ABST
Patent Text Reader

Abstract

This invention discloses an RGB-D saliency detection method based on progressively weighted decoding, belonging to the field of computer vision. The method includes: extracting multi-level RGB image features and depth image features through a symmetrical dual-stream feature extraction backbone network; enhancing the extracted high-level features using a multi-scale self-attention feature enhancement module; subsequently fusing the RGB branch features and depth branch features using a cross-modal feature fusion module to enhance cross-modal information interaction between different branches; then decoding the fused features layer by layer using a progressively weighted fusion decoder to fully integrate and refine the feature information of different modalities; finally, using a hybrid loss function to supervise the model training, optimizing the initial saliency map, and completing RGB-D saliency detection based on progressively weighted decoding. Experimental results demonstrate that this invention has high target detection accuracy in various scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an RGB-D saliency detection method based on progressive weighted decoding, belonging to the field of computer vision. Background Technology

[0002] Salient Object Detection (SOD) is a significant task in computer vision. SOD aims to mimic the human eye's ability to automatically focus on the most prominent objects in a scene, identifying and segmenting the most salient objects or regions in an image. It is widely used in many vision tasks, such as image segmentation, visual tracking, remote sensing, biomedical applications, and maritime and land traffic detection. Early salient detection methods relied solely on RGB images, using the rich color and texture information provided to extract visually salient regions. However, in scenes with complex backgrounds and low contrast between the target and background, single-modal feature information is insufficient for salient detection. Therefore, with advancements in hardware and deep learning algorithms, the widespread availability of depth image acquisition devices has made depth information acquisition easier, leading to the development of RGB-D-based salient detection methods. By incorporating a depth image corresponding to the RGB image, the model can obtain richer spatial priors and geometric information during detection, resulting in better performance.

[0003] With the rapid development of deep learning and the significant improvement in computing power in recent years, convolutional neural networks (CNNs) have gained considerable attention in the field of computer vision. CNNs have also achieved good results in RGB-D saliency detection. However, CNNs extract feature information from the pixel neighborhood through convolution operations, making it difficult to learn global semantic information; furthermore, their commonly used pooling operations tend to lose spatial information, which can negatively impact the final detection performance.

[0004] Most existing RGB-D saliency detection methods tend to ignore differences between scales during feature extraction, making it difficult to effectively extract multi-scale features. This leads to a decline in detection performance when dealing with small-scale or multi-scale targets in the scene. Furthermore, many methods suffer from insufficient modal feature information exchange during decoding, resulting in information loss and incomplete salient target detection, as well as reduced prediction accuracy. Summary of the Invention

[0005] To further improve the detection accuracy of salient target detection methods, this invention provides an RGB-D saliency detection algorithm based on progressive weighted decoding, the technical solution of which is as follows:

[0006] The first objective of this invention is to provide a salient target detection method, comprising:

[0007] Step 1: Obtain the RGB image and corresponding depth image of the target to be detected;

[0008] Step 2: Extract multi-level RGB image features and depth image features using a symmetrical two-stream feature extraction backbone network, denoted as r respectively. i and d i , i∈{1,2,…,N}, where i is the feature layer number extracted and N is the total number of feature layers extracted;

[0009] Step 3: Utilize the Multi-scale Self-attentional Feature Enhancement (MSFE) module to process the deepest features r extracted in Step 2. N and d N The process is performed to obtain the enhanced high-level features r. MSFE and d MSFE ;

[0010] Step 4: Utilize the Cross-modal Feature Fusion Module (CFF) to process the high-level features r respectively. MSFE and d MSFE and the features r of the first N-1 layers i and d i The features f are obtained by fusing i∈{1,2,…,N-1}. i , i∈{1,2,…,N};

[0011] Step 5: Use the Progressive Weighted Fusion Decoder (PWF) to decode the various features in the RGB branch, depth branch, and fusion branch layer by layer to obtain the predicted salient image S of the RGB branch. r Predicted salient image S of deep branches d The predicted salient image S of the fusion branch f ;

[0012] Step 6: Obtain the predicted saliency image S of the fusion branch. f As a result of salient target detection.

[0013] Optionally, step 5 includes:

[0014] Step 51: Using symmetrically arranged Single-modality Feature Aggregation (SFA) modules, preliminary aggregation is performed on one of the single-modality features in the RGB branch and the depth branch respectively. The output of the previous SFA module becomes the input of the next SFA module, ultimately obtaining the predicted salient image S of the RGB branch. r Predicted salient image S of deep branches d ;

[0015] Step 52: The RGB branch features, depth branch features, and fusion branch features aggregated by the SFA module are uniformly fused and decoded through the Dynamic Weighted Fusion (DWF) module. The dynamic weights k generated by the residual channel attention operation are combined according to the importance of the feature information of different modalities. The output of each DWF module becomes the input of the next DWF module, completing the progressive fusion and obtaining the initial salient image Sf.

[0016] Optionally, the MSFE module in step 3 applies feature r. N and d N The processing steps include:

[0017] Step 31: Apply a 3×3 convolutional layer with BatchNorm normalization and PReLU activation to the input features r. N and d N Process it;

[0018] Step 32: Input the output of Step 31 into five parallel-loaded convolutional layer branches to extract multi-scale features; the first branch uses global average pooling to extract global information from high-level features and connects Conv1×1 convolution and linear upsampling function to adjust the channels and size; the last branch uses 1×1 convolution with BatchNorm normalization and PReLU activation; for the three middle branches, spatially separable convolutions with kernel sizes of 3, 5 and 7 are connected to dilated convolutions with dilation rates of 3, 5 and 7 respectively to capture multi-scale features, and a self-attention mechanism is added after the dilated convolution to extract contextual information;

[0019] Step 33: Combine the outputs of the five branches through a concatenation operation to obtain the enhanced high-level feature r. MSFE and d MSFE .

[0020] Optionally, the processing procedure of the CFF module in step 4 includes:

[0021] Step 41: Multiply the input RGB features and depth features, and use the spatial attention mechanism (SA) to calculate the common spatial attention feature f. sa ;

[0022] Step 42: Compare the original RGB features and depth features with f respectively. sa Multiply them, and pass the weighted features using residual connections respectively. Then, input the results into the channel attention mechanism CA respectively.

[0023] Step 43: Multiply the features obtained in Step 42 by the original RGB features and the depth features respectively to obtain the RGB channel calibration features r. ca and depth channel calibration features d ca ;

[0024] Step 44: calibrate the RGB channel feature r ca and depth channel calibration features d ca After performing element-wise addition, element-wise multiplication, and channel concatenation operations, the fusion weight m is obtained by normalization using the Sigmoid function;

[0025] Step 45: Combine the two branch features using the fusion weight m, and obtain the fusion feature f through a skip multiplication connection with a channel attention mechanism (CA). i , i∈{1,2,…,N}.

[0026] Optionally, the processing procedure of the SFA module includes: performing a linear upsampling operation on the output features of the previous SFA module, then concatenating them with the features of the corresponding layer using channel concatenation, and finally inputting the concatenation result into a 3×3 convolution with a BatchNorm normalization layer and a ReLU activation layer to obtain modality aggregation features.

[0027] Optionally, the processing procedure for the i-th DWF module includes:

[0028] Aggregate RGB features of the input i and deep aggregation feature fd i After channel concatenation, the input is processed by a Conv3×3 convolutional layer for channel dimensionality reduction, resulting in a fused RGB-D decoder feature f. rd ;

[0029] The feature output ff from the previous DWF module i+1 With the corresponding fusion feature f i The concatenation is performed and the data is input into a Conv3×3 convolutional layer to obtain the feature f. s , feature f rd and f sThe concatenation is performed, and the concatenation result is input into a Conv1×1 convolution and a residual structure containing a channel attention mechanism.

[0030] A learnable dynamic weight k is obtained through a Bconv3×3 convolutional layer and a sigmoid layer, and the RGB-D decoder features f are then used to... rd The fusion feature f of the previous level s The weights are combined using dynamic weights k, and then passed through two consecutive Bconv3×3 convolutional layers and a linear upsampling layer to output the result ff. i .

[0031] Optionally, the method uses a hybrid loss function to train the network, which is obtained by adding the cross-entropy loss function, the intersection-union loss function, and the structural similarity loss function;

[0032] The total loss function of the network is:

[0033] Loss=L(Sr,GT)+L(Sd,GT)+L(Sf,GT)

[0034] Where L(Sr,GT), L(Sd,GT), and L(Sf,GT) represent the mixed loss functions of the RGB branch, the depth branch, and the fusion branch, respectively, GT represents the corresponding standard ground truth image, and Sr, Sd, and Sf represent the salient images predicted by the three branches, respectively.

[0035] Optionally, the dual-stream feature extraction backbone network uses the Swing Transformer network.

[0036] A second objective of this invention is to provide an electronic device, including a memory and a processor;

[0037] The memory is used to store computer programs;

[0038] The processor is configured to implement the salient target detection method as described above when executing the computer program.

[0039] A third objective of the present invention is to provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the salient target detection method as described in any of the preceding claims.

[0040] The beneficial effects of this invention are:

[0041] (1) Existing RGB-D saliency detection methods based on encoder-decoder often suffer from insufficient interaction of modal feature information during the decoding process, leading to information loss. This results in incomplete targets and blurred edges in the final prediction results. This invention proposes a progressive weighted fusion decoder that integrates features at each level from deep to shallow and from bottom to top to achieve progressive decoding. During the fusion process within the module, the decoder uses dynamic weights k generated through operations such as residual channel attention to dynamically combine feature information from different modalities according to their importance. This fully connects local and global features, which is beneficial for enhancing information interaction between different modalities and amplifying contributing features, thereby improving the final decoding effect and detection accuracy.

[0042] (2) Most existing RGB-D saliency detection methods tend to ignore differences at different scales during feature extraction, making it difficult to effectively extract multi-scale features, resulting in low detection accuracy for multi-target and small-target objects. To address this problem, this invention proposes a multi-scale self-attention feature enhancement module, which can use a parallel branch structure to capture receptive fields at different scales, simulating the effect of human visual receptive fields, effectively extracting multi-scale features of salient objects, and ultimately improving the accuracy of saliency detection.

[0043] (3) This invention utilizes a cross-modal feature fusion module to enhance the semantic information of RGB features and the spatial location information of depth features by using spatial attention and channel attention mechanisms to extract RGB features and depth features from the backbone network, ultimately achieving cross-modal fusion. This enhancement-then-fusion approach fully leverages the advantages of different modalities, maximizes the utilization of modal features, and provides more valuable information for subsequent decoding.

[0044] (4) In one embodiment of the present invention, the Swing Transformer is used as the backbone network to extract features. It combines the advantages of Transformer in global modeling and capturing long-distance contextual information with the translation invariance and hierarchy of CNN. It can better complete the feature extraction in the encoding stage and solve the problem that existing detection methods based on convolutional neural networks (CNN) are prone to losing spatial information due to pooling operations and have difficulty learning global long-distance semantic information. Attached Figure Description

[0045] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0046] Figure 1 This is a flowchart of the RGB-D saliency detection method based on progressive weighted decoding of the present invention.

[0047] Figure 2 This is a schematic diagram of the network structure of the RGB-D saliency detection method based on progressive weighted decoding of the present invention.

[0048] Figure 3 This is a schematic diagram of the structure of the Multi-Scale Self-Attention Feature Enhancement Module (MSFE).

[0049] Figure 4 This is a schematic diagram of the cross-modal feature fusion module (CFF).

[0050] Figure 5 This is a schematic diagram of the progressive weighted fusion decoder (PWF).

[0051] Figure 6 This is a schematic diagram of the Single Modal Feature Aggregation Module (SFA).

[0052] Figure 7 This is a schematic diagram of the Dynamic Weighted Fusion Module (DWF).

[0053] Figure 8 This is a comparison chart of the results of this invention with other RGB-D saliency detection methods. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

[0055] Example 1:

[0056] This embodiment provides a salient target detection method, including:

[0057] Step 1: Obtain the RGB image and corresponding depth image of the target to be detected;

[0058] Step 2: Extract multi-level RGB image features and depth image features using a symmetrical two-stream feature extraction backbone network, denoted as r respectively. i and d i , i∈{1,2,…,N}, where i is the feature layer number extracted and N is the total number of feature layers extracted;

[0059] Step 3: Utilize the Multi-Scale Self-Attention Feature Enhancement (MSFE) module to enhance the deepest features r extracted in Step 2. N and d N The process is performed to obtain the enhanced high-level features r. MSFE and d MSFE ;

[0060] Step 4: Use the cross-modal feature fusion module CFF to process the high-level features r respectively. MSFE and d MSFE and the features r of the first N-1 layers i and d i The features f are obtained by fusing i∈{1,2,…,N-1}. i , i∈{1,2,…,N};

[0061] Step 5: Utilize the Progressive Weighted Fusion Decoder (PWF) to decode various features layer by layer in the RGB branch, depth branch, and fusion branch to obtain the predicted salient image S of the RGB branch. r Predicted salient image S of deep branches d The predicted salient image S of the fusion branch f ;

[0062] Step 6: Obtain the predicted saliency image S of the fusion branch f As a result of salient target detection.

[0063] Example 2:

[0064] This embodiment takes the extraction of four layers of features using a dual-stream feature extraction backbone network as an example, and provides an RGB-D saliency detection method based on progressive weighted decoding. The specific implementation process is as follows:

[0065] S1: Obtain the RGBD image dataset, including a training set and a test set consisting of RGB images and their corresponding depth images.

[0066] This embodiment obtains commonly used public datasets for RGB-D saliency detection, including: NJU2K dataset, NLPR dataset, STERE dataset, SSD dataset, SIP dataset, etc., and the images in each dataset contain RGB image, depth image and ground truth image of salient object.

[0067] S2: Input the RGB image and its corresponding depth image into the saliency detection model. A symmetrical dual-stream feature extraction backbone network extracts multi-level RGB image features and depth image features, denoted as r. i and d i (i∈{1,2,3,4}, where i is the extracted feature layer number).

[0068] This embodiment uses the Swing Transformer network as the feature extraction backbone network, specifically the Swing-Base-window12 version. The input image size is 384*384. Two symmetrically arranged dual-stream Swing Transformer backbone networks are used to extract RGB and depth image features from the input dataset, respectively. The RGB image features at the first to fourth scale levels are obtained in the first to fourth feature extraction layers of the Swing Transformer network. i and depth image features d i (i∈{1,2,3,4}).

[0069] S3: The deepest features r4 and d4 extracted from S2 are input into the multi-scale self-attention feature enhancement module MSFE. Spatially separable convolution effectively captures the multi-scale features of the object, and the self-attention mechanism is used to further expand the receptive field, establishing long-distance dependencies globally, and obtaining enhanced high-level features r containing multi-scale information. MSFE and d MSFE .

[0070] The structure of the MSFE module is as follows: Figure 3 As shown, the operation includes global average pooling, linear upsampling, spatially separable convolution, dilated convolution, and self-attention mechanism. Its input is the high-level features extracted by the dual-stream backbone network, namely the fourth layer features r4 and d4 of the RGB branch or the depth branch. The features are then fed into a 3×3 convolutional layer with BatchNorm normalization and PReLU activation, and then fed into five parallel convolutional layer branches to extract multi-scale features.

[0071] The first branch uses global average pooling to extract global information from high-level features and then connects it with a Conv1×1 convolution and a linear upsampling function to adjust the channels and size. The last branch uses a 1×1 convolution with BatchNorm normalization and PReLU activation. For the three middle branches, spatially separable convolutions with kernel sizes of 3, 5, and 7 are connected to dilated convolutions with dilation rates of 3, 5, and 7, respectively, to capture multi-scale features. A self-attention mechanism is added after the dilated convolutions to extract contextual information. The outputs of the five branches are then fused together through a concatenation operation to obtain the enhanced high-level feature r. MSFE and d MSFE .

[0072] S4: The first three layers of RGB features and depth features extracted from the backbone network in S2, along with the high-level features enhanced by the MSFE module in S3, are sequentially input into the cross-modal feature fusion module (CFF) for fusion to obtain the fused feature f. i (i∈{1,2,3,4}).

[0073] Specifically, the structure of a CFF module is as follows: Figure 4 As shown, the CFF module includes operations such as element-wise multiplication, element-wise addition, spatial attention, channel attention, and channel cascading. The input to the CFF module is the RGB features r extracted from the first three layers of the backbone network. i and depth features d i , i∈{1,2,3} and the high-level feature r enhanced by the MSFE module MSFE and d MSFE .

[0074] The CFF module first processes the input RGB features r i and depth features d i Multiply them and use the spatial attention mechanism SA to calculate their common spatial attention feature f. sa The original RGB features and depth features are respectively compared with f sa The features are multiplied and weighted features are passed through residual connections. The results are then fed into the channel attention mechanism (CA) and multiplied with the original RGB features and depth features respectively to obtain the RGB channel calibration features r. ca and depth channel calibration features d ca r ca and d ca After performing element-wise addition, element-wise multiplication, and channel concatenation, the fusion weights *m* are obtained by normalization using the Sigmoid function. The two branch features are then combined using these fusion weights *m* and passed through a skip multiplication join with a channel attention mechanism (CA) to obtain the fusion feature *f*. i .

[0075] S5: Features from the RGB branch, depth branch, and fusion branch are sequentially input into the progressively weighted fusion decoder (PWF) for layer-by-layer decoding to obtain the initial prediction saliency map. The structure of the progressively weighted fusion decoder is as follows: Figure 5 As shown, it mainly consists of a single-modal feature aggregation module (SFA) and a dynamic weighted fusion module (DWF).

[0076] The specific decoding method includes: using symmetrically arranged SFA modules to initially aggregate one of the single-modal features from the RGB branch and the depth branch respectively. The characteristic is that the output of the previous SFA module becomes the input of the next SFA module, and high-level features are progressively integrated into low-level features, providing corresponding levels of modal features for each subsequent layer of the decoder, and finally outputting initial saliency maps Sr and Sd based on the single-modal features. The dynamic weighted fusion module DWF performs unified fusion decoding of the RGB branch features, depth branch features, and fusion branch features aggregated by the SFA module. During the fusion process within the module, dynamic weights k generated through operations such as residual channel attention are used to dynamically combine the feature information of different modalities according to their importance. The output of each DWF module becomes the input of the next DWF module, completing progressive fusion and obtaining the initial saliency image Sf.

[0077] At the end of the RGB branch, depth branch, and dynamic weighted fusion branch of the decoder, that is, in the last two SFA modules and the last DWF module arranged symmetrically, the initial saliency images are predicted by channel-wise dimensionality reduction through Conv3×3 convolutional layers, denoted as Sr, Sd, and Sf respectively. Finally, the predicted saliency map Sf of the dynamic weighted fusion branch is selected as the final saliency prediction result.

[0078] The structure of the SFA module is as follows: Figure 6 As shown, the input to the SFA module is the output r of each layer in the RGB feature extraction branch and the deep feature extraction branch of the backbone network. i and d i (i∈{1,2,3,4}), and the output fr of the previous SFA module i+1 In particular, when i = 4, fr i+1 and fd i+1 These represent the high-level features r after the feature enhancement module. MSFE and d MSFE Taking the RGB feature branch as an example, the output of the previous feature aggregation module is... i+1 Perform a linear upsampling operation, and then combine it with the feature r of the corresponding layer. i The channels are concatenated to initially aggregate the feature information of adjacent layers. The results are then input into a Conv1×1 convolutional layer and a Bconv3×3 convolutional layer (a 3×3 convolution with BatchNorm and ReLU activation) to obtain the RGB modality aggregated feature. i .

[0079] The single-modal feature aggregation process is as follows:

[0080]

[0081] in, This indicates a linear upsampling operation; `cat` indicates channel concatenation; `BConv` 3×3 This represents a 3×3 convolution with BatchNorm and ReLU activation; r i and d i Let fr represent the RGB features and depth features of the i-th layer, respectively. i and fd i Represent the RGB modal aggregation features and depth modal aggregation features of the i-th layer, respectively. Specifically, when i = 4, fr i+1 and fd i+1 These represent the high-level features r after the feature enhancement module. MSFE and d MSFE .

[0082] The structure of the DWF module is as follows: Figure 7 As shown, the input is the fused feature f output by the CFF module. i The corresponding SFA module outputs the RGB aggregated feature fr i and deep aggregation feature fd i Taking the i-th DWF module as an example, the input RGB aggregated feature fr i and deep aggregation feature fd i After channel concatenation, the input is subjected to channel dimensionality reduction by a Conv3×3 convolutional layer to obtain a fused RGB-D decoder feature f. rd Next, the feature output ff from the previous DWF module will be... i+1 With the corresponding fusion feature f i The concatenation is performed and the data is input into a Conv3×3 convolutional layer to obtain the feature f. s :

[0083] f rd =Conv 3×3 (cat(fr i , fd i ), i=4,3,2,1 (3)

[0084] f s =Conv 3×3 (cat(f i , ff i+1 ), i=4,3,2,1 (4)

[0085] Where cat indicates channel cascading, Conv 3×3 Represents a 3×3 convolution, ff i+1 f represents the output of the previous DWF module. i This represents the fusion feature of the i-th layer. In particular, when i = 4, there is no ff. i+1 .

[0086] feature f rd and f s The layers are concatenated, and the concatenated result is input into a Conv1×1 convolution and a residual structure containing a channel attention mechanism; then, a learnable dynamic weight k is obtained through a Bconv3×3 convolutional layer and a sigmoid layer.

[0087] f ca =Conv 1×1 (cat(f rd ,f s (5)

[0088]

[0089] Among them, Conv 1×1 This represents a 1×1 convolution, and CA represents the channel attention operation. σ represents element-wise multiplication, and σ represents the sigmoid activation function.

[0090] Finally, the previously obtained RGB-D decoder features f rd The fusion feature f of the previous level s The weights are combined using dynamic weights k, and then passed through two consecutive Bconv3×3 convolutional layers and a linear upsampling layer to output the result ff. i :

[0091]

[0092] in, This indicates a linear upsampling operation, which doubles the size of the original size, and k represents the dynamic weight.

[0093] S6: For the predicted salient images Sr, Sd, and Sf of the three branches of RGB, depth, and dynamic weighted fusion in the progressive weighted fusion decoder, the loss is calculated using a hybrid loss function, and supervision is implemented through the corresponding ground truth images to complete the optimized training of the network model.

[0094] The mixed loss function L is derived from the cross-entropy loss function L bce Intersection and union loss function L iou and structural similarity loss function L ssim The result is obtained by addition, and the specific calculation method is as follows:

[0095] L(S,GT)=L bce (S,GT)+L iou (S,GT)+L ssim (S,GT) (8)

[0096] Where L represents the mixture loss function, L bceL represents the BCE cross-entropy loss function. iou L represents the intersection and union loss function of IOU. ssim Let S denote the SSIM structural similarity loss function, where S represents the salient image predicted by the network, and GT represents the corresponding standard ground truth image.

[0097] Calculating the total network loss requires applying a hybrid loss function to the final predicted image of each branch, and supervising the process using the corresponding ground truth image to accelerate network convergence. The method for calculating the total network loss is as follows:

[0098] Loss=L(Sr,GT)+L(Sd,GT)+L(Sf,GT) (9)

[0099] Here, Loss represents the total loss function of the network.

[0100] The overall saliency detection network is trained and optimized. A hybrid loss function is used to calculate the sum of the losses of the initial saliency maps output by the three branches of the decoder. The Adam optimization training model is used, with the batch size set to 8 and the initial learning rate set to 5e-5. The learning rate is divided by 10 every 100 rounds. The model needs to be trained for 150 rounds to complete RGB-D saliency detection based on progressive weighted decoding.

[0101] Example 3:

[0102] To demonstrate the effectiveness and superiority of the saliency detection method of this invention, this embodiment compares the detection results of this invention with those of eight other advanced RGB-D saliency detection algorithms, including four CNN-based saliency detection algorithms (D3Net, SPNet, HAINet, SPSN) and four Transformer-based saliency detection algorithms (VST, TriTransNet, SwinNet, CAVER). The comparison results are as follows: Figure 8 As shown.

[0103] First, (a) and (b) show cases where the foreground and background are similar and have low contrast. In (a), the sculpture base is very similar to the background wall, which leads to missed detections in existing methods and makes it difficult to detect salient targets in the whole. However, the method proposed in this invention can completely detect salient targets.

[0104] (c) and (d) illustrate situations with complex backgrounds. In (c), the background is complex and vibrant, and the backgrounds are almost on the same plane, which leads to multiple detections and false detections in most comparison methods. In (d), the salient target is small, the background is complex and the color is close to that of the foreground, and the comparison method does not detect the details of the salient target. The target detected by this invention is relatively complete.

[0105] (e), (f), and (g) illustrate cases with low depth map quality. In (e), the horse tail has a high degree of fineness, making it difficult for most contrast methods to detect. The depth map in (f) fails to provide useful information, and the salient target has a complex shape that interferes with the background. The depth map in (g) provides interfering information, leading to false detections and multiple detections in the contrast method. This invention can predict salient targets completely and accurately.

[0106] (h) and (i) illustrate the case of multiple targets. In (h), three targets to be detected appear simultaneously, and the depth map has some interference. Some comparison methods can only detect two targets or the detection is incomplete. In (i), the two targets are brightly colored and have a large area. The background is complex and cluttered, and the water cup in the lower right corner introduces misleading information. The comparison method has poor detection results for this target. In contrast, the present invention has produced a satisfactory effect.

[0107] (j) and (k) illustrate the situation of small target objects. Compared with other comparison methods, the present invention can more accurately identify the outline of small targets.

[0108] (i) This demonstrates a severe case of uneven lighting, where the lower half of a prominent figure is subjected to strong lighting interference, making it impossible for the contrast method to accurately detect this.

[0109] (m) shows cases where the colors inside a salient object are clearly distinguished and the contrast is high, making it impossible for the contrast method to detect the target completely and continuously.

[0110] In summary, the present invention can produce relatively reliable detection results in these situations. Therefore, the salient target detection method of the present invention has high detection accuracy and better salient detection performance compared with the prior art.

[0111] Some steps in the embodiments of the present invention can be implemented using software, and the corresponding software program can be stored in a readable storage medium, such as an optical disc or a hard disk.

[0112] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A salient object detection method, characterized in that, The method includes: Step 1: Obtain the RGB image and corresponding depth image of the target to be detected; Step 2: Extract multi-level RGB image features and depth image features respectively by a symmetric double-flow feature extraction backbone network, denoted as and , , i is the layer number of the extracted features, N is the total number of extracted features; Step 3: using a multi-scale self-attention feature enhancement module MSFE to process the deepest-level features extracted in step 2 to obtain enhanced high-level features and and ;​ Step 4: Use the cross-modal feature fusion module (CFF) to process the high-level features respectively. and and the former N -1 layer features and , By performing fusion, fusion characteristics are obtained. , ; Step 5: decode each type of features in RGB branch, depth branch and fusion branch layer by layer using a progressive weighted fusion decoder (PWF) to obtain the predicted saliency image of RGB branch S r , the predicted saliency image of depth branch S d and the predicted saliency image of fusion branch S f ; Step 6: Obtain the predicted saliency image of the fusion branch. S f As a result of salient target detection; Step 5 includes: Step 51: Using symmetrically arranged single-modal feature aggregation modules (SFAs), preliminary aggregation is performed on one of the single-modal features in the RGB branch and the depth branch, respectively. The output of the previous SFA module becomes the input of the next SFA module, ultimately obtaining the predicted salient image of the RGB branch. S r Predicted salient images of deep branches S d ; Step 52: The RGB branch features, depth branch features, and fusion branch features aggregated by the SFA module are uniformly fused and decoded using the Dynamic Weighted Fusion Module (DWF). Dynamic weights are generated through residual channel attention operations. k By combining feature information from different modalities according to their importance, the output of each DWF module becomes the input of the next DWF module, thus completing progressive fusion and obtaining an initial salient image. Sf .

2. The salient target detection method according to claim 1, characterized in that, The MSFE module in step 3 performs feature analysis. and The processing steps include: Step 31: Apply the input features to a 3×3 convolutional layer with BatchNorm normalization and PReLU activation. and Process it; Step 32: Input the output of Step 31 into five parallel-loaded convolutional layer branches to extract multi-scale features; the first branch uses global average pooling to extract global information from high-level features and connects Conv1×1 convolution and linear upsampling function to adjust the channels and size; the last branch uses 1×1 convolution with BatchNorm normalization and PReLU activation; for the three middle branches, spatially separable convolutions with kernel sizes of 3, 5 and 7 are connected to dilated convolutions with dilation rates of 3, 5 and 7 respectively to capture multi-scale features, and a self-attention mechanism is added after the dilated convolution to extract contextual information; Step 33: Combine the outputs of the five branches through a cascading operation to obtain the enhanced high-level features. and .

3. The salient target detection method according to claim 1, characterized in that, The processing procedure of the CFF module in step 4 includes: Step 41: Multiply the input RGB features and depth features, and use the spatial attention mechanism (SA) to calculate the common spatial attention feature. ; Step 42: Compare the original RGB features and depth features with... Multiply the results and pass the weighted features using residual connections, then input the results into the channel attention mechanism CA. Step 43: Multiply the features obtained in Step 42 by the original RGB features and the depth features respectively to obtain the RGB channel calibration features. and depth channel calibration features ; Step 44: calibrate the RGB channel features and depth channel calibration features After performing element-wise addition, element-wise multiplication, and channel concatenation operations, the fusion weights are obtained by normalization using the Sigmoid function. m ; Step 45: Utilize the fusion weights m The two branch features are combined and passed through a skip multiplication join with a channel attention mechanism (CA) to obtain the fused feature. , .

4. The salient target detection method according to claim 1, characterized in that, The processing procedure of the SFA module includes: performing a linear upsampling operation on the output features of the previous SFA module, then concatenating them with the features of the corresponding layer using channel concatenation, and finally inputting the concatenation result into a 3×3 convolution with a BatchNorm normalization layer and a ReLU activation layer to obtain modality aggregation features.

5. The salient target detection method according to claim 1, characterized in that, No. i The processing procedure for each DWF module includes: RGB aggregated features of the input and deep aggregation features After channel concatenation, the input is processed by a Conv3×3 convolutional layer for channel dimensionality reduction, resulting in a fused RGB-D decoder feature. ; The features output by the previous DWF module With the corresponding fusion features The features are cascaded and fed into a Conv3×3 convolutional layer to obtain the features. , will feature and The concatenation is performed, and the concatenation result is input into a Conv1×1 convolution and a residual structure containing a channel attention mechanism. A learnable dynamic weight is obtained through Bconv3×3 convolutional layers and sigmoid layers. k The RGB-D decoder features Integration features with the previous level Through dynamic weights k The components are combined and then passed through two consecutive Bconv3×3 convolutional layers and a linear upsampling layer to output the result. .

6. The salient target detection method according to claim 1, characterized in that, The method employs a hybrid loss function to train the network, which is obtained by adding the cross-entropy loss function, the intersection-union loss function, and the structural similarity loss function. The total loss function of the network for: in, Let represent the hybrid loss functions for the RGB branch, depth branch, and fusion branch, respectively. GT This represents the corresponding standard truth image. , , These represent the saliency images of the three branch predictions.

7. The salient target detection method according to claim 1, characterized in that, The dual-stream feature extraction backbone network uses the Swing Transformer network.

8. An electronic device, characterized in that, Including memory and processor; The memory is used to store computer programs; The processor is configured to implement the salient target detection method as described in any one of claims 1 to 7 when executing the computer program.

9. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the salient target detection method as described in any one of claims 1 to 7.