An image instance segmentation method and system based on multi-modal fusion

By combining RGB and deep feature extraction networks, and utilizing local multi-head self-attention modules and cross-attention mechanisms, the image instance segmentation system with multimodal fusion solves the problems of segmentation accuracy and robustness in densely stacked object scenes, achieving efficient and accurate instance segmentation.

CN122244438APending Publication Date: 2026-06-19HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-03-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing instance segmentation methods have poor segmentation accuracy in scenes with densely stacked objects, difficulty in distinguishing boundaries when objects have similar appearances, and insufficient robustness to changes in lighting and material. Traditional methods are also time-consuming and resource-intensive.

Method used

A multimodal fusion image instance segmentation system is adopted, which combines RGB and depth feature extraction networks. Through local multi-head self-attention modules and cross-attention mechanisms, depth features are extracted and enhanced, and visual and spatial information are fused to achieve more accurate instance segmentation.

Benefits of technology

It improves segmentation accuracy and robustness in dense object scenes, reduces computational resource consumption, and enhances segmentation efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244438A_ABST
    Figure CN122244438A_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal fusion image instance segmentation method and system, belonging to the field of image processing. The system utilizes a local multi-head self-attention module to focus on information with consistent distribution within local windows in the image, enhancing the features of the depth map. This strengthens the image instance segmentation network's perception of local information distribution and improves its ability to detect object edges. A feature fusion module uses depth map features as guiding information to fuse with RGB features, further enhancing the RGB features. Finally, a segmentation prediction head analyzes the fused features to obtain the segmentation result. By introducing spatial information from the depth map, this invention helps the image instance segmentation network better perceive object edges, solving problems such as mask adhesion and unclear output mask boundaries, thus improving the accuracy and effectiveness of image instance segmentation. This invention is well-suited for dense object scenes and can adapt to segmentation tasks involving objects with different appearance features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image processing, and more specifically, relates to an image instance segmentation method and system based on multimodal fusion. Background Technology

[0002] In many tasks, such as object counting, pedestrian recognition, and pose estimation, instance segmentation methods play a crucial role as a key component. With the development of intelligent technologies, instance segmentation methods are being applied to various scenarios, such as autonomous driving and industrial automated manufacturing. Therefore, researching an effective instance segmentation method is of great significance.

[0003] Traditional instance segmentation methods utilize edge information of objects and the contrast between foreground and background to achieve instance segmentation. However, stringent hyperparameter settings, poor generalization and segmentation accuracy, and slow inference speed make traditional methods difficult to meet the large-scale segmentation needs in various scenarios. The development of deep learning technology, however, provides a more efficient solution for instance segmentation.

[0004] However, existing deep learning-based instance segmentation methods still face many challenges. In scenes with densely stacked objects, mutual occlusion makes it difficult for the network to extract complete object features and achieve accurate segmentation. Furthermore, when these stacked objects have very similar appearances, it becomes difficult to distinguish object boundaries using appearance features, leading to issues such as mask merging. In addition, changes in scene lighting and the surface material of objects affect the appearance features of objects in the image, making it difficult for most current instance segmentation methods to achieve generalization and robustness across different scenes.

[0005] To address the issues of poor segmentation accuracy and output mask aggregation in densely packed object scenes, some instance segmentation methods employ cumbersome image preprocessing or post-processing operations to ensure uniformity across different scenes or to simulate the characteristics of different scenes. However, these methods require additional processing time and computational resources. Therefore, there is an urgent need for a simpler and more efficient instance segmentation method and system. Summary of the Invention

[0006] In view of the above-mentioned defects or improvement needs of the existing technology, the present invention provides an image instance segmentation method and system based on multimodal fusion, which can achieve more accurate, efficient and robust image instance segmentation.

[0007] To achieve the above objectives, according to a first aspect of the present invention, an image instance segmentation system based on multimodal fusion is provided, comprising: An RGB feature extraction network is used to extract multi-scale RGB feature maps from the RGB image of the object to be segmented. A deep feature extraction network is used to extract multi-scale depth feature maps from the depth image of the object to be segmented. The multi-scale depth feature map and the multi-scale RGB feature map correspond one-to-one in scale, and the feature dimensions remain consistent at the same scale. The local multi-head self-attention module is used to process the depth feature maps at each scale as the target depth feature map to obtain the final enhanced depth features at each scale. The processing includes: projecting the target depth feature maps into query matrices respectively. Key matrix and value matrix ; , , The feature dimension is the same as the feature dimension of the target depth feature map; according to the preset number of attention heads M, respectively... , , The feature dimensions are divided into M groups. , , Feature subgraphs; for the m-th group , , The feature sub-map is divided into I groups of local windows, and the i-th local window is the m-th local window. , , The feature sub-image is a window of size k×k centered on the i-th feature pixel; i=1,2,…,I, where I is the number of feature pixels in the feature sub-image; based on the local query matrix within the i-th local window. Local bond matrix and local value matrix Perform a self-attention operation to obtain the local enhancement features of the i-th local window; merge the local enhancement features of the i-th local window to obtain the m-th local window. , , Enhance the features of the sub-maps of the feature sub-maps; group M , , The sub-image enhancement features of the feature sub-image are concatenated to obtain the preliminary enhanced depth features, and then residual connection is performed between them and the target depth feature map to obtain the final enhanced depth features. The feature fusion module is used to fuse the final enhanced depth features at each scale with the RGB feature map to obtain enhanced RGB features, and add them to the RGB feature map to obtain the fused features at each scale. The segmentation prediction head includes a category branch, a bounding box branch, and a mask branch, which are used to process the fused features at all scales to obtain the category, bounding box, and mask of each object in the object to be segmented.

[0008] According to a second aspect of the present invention, a training method for an image instance segmentation system based on multimodal fusion as described in the first aspect is provided, comprising: The RGB feature extraction network, deep feature extraction network, local multi-head self-attention module, feature fusion module, and segmentation prediction head of the image instance segmentation system are trained in a supervised manner using a training dataset.

[0009] According to a third aspect of the present invention, an image instance segmentation method based on multimodal fusion is provided, comprising: The RGB image and depth image of the object to be segmented are input into the image instance segmentation system based on multimodal fusion as described in the first aspect to obtain the category, bounding box and mask of each object in the object to be segmented.

[0010] According to a fourth aspect of the present invention, an electronic device is provided, comprising: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method described in the second or third aspect.

[0011] According to a fifth aspect of the invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to perform the method as described in the second or third aspect.

[0012] According to a sixth aspect of the invention, a computer program product is provided, comprising a computer program or instructions that, when executed by a processor, implement the method described in the second or third aspect.

[0013] In summary, compared with the prior art, the above-described technical solutions conceived by this invention can achieve the following beneficial effects: Existing methods for instance segmentation rely solely on RGB appearance features. These methods extract the RGB visual appearance features of an image and segment objects based on differences in appearance between objects and between foreground and background. However, in scenes with densely packed objects, especially when different objects have similar appearances, relying solely on visual appearance features is insufficient to distinguish object boundaries, resulting in poor instance segmentation performance. Unlike existing methods, the segmentation system provided in this invention introduces a depth map modality. By extracting and enhancing depth features, spatial information is extracted from the depth image. This spatial information is then incorporated into the instance segmentation network, enabling the network to determine object boundaries both visually and spatially, effectively addressing the problems caused by similar object appearances and dense object stacking.

[0014] Furthermore, some existing methods directly use the depth map as a new input channel, allowing the instance segmentation network to learn the feature projection methods of both the RGB image and the depth map simultaneously. Although this method provides some spatial enhancement, it can only achieve simple depth feature projection. Unlike existing methods, the segmentation system provided in this invention does not simply use the depth map as a new input channel to introduce depth modalities. Instead, it designs a depth feature extraction network and a local multi-head self-attention module to extract depth modal information. The depth feature extraction network is used to extract multi-scale features from the depth map, and the local multi-head self-attention module interacts with the depth features in the local window region by sliding through the depth feature map, discovering the continuously distributed spatial information in the region. This makes the segmentation system more clearly define the overall region distribution belonging to the same object, which is beneficial to improving the instance segmentation effect.

[0015] As a further preferred embodiment, the segmentation system provided by this invention, when performing feature fusion between RGB features and depth features, does not use simple addition operations to allow the two modal features to interact in terms of numerical amplitude, failing to reflect the characteristic differences between different modal features. Concatenation operations would double the number of parameters and computational cost of subsequent processing operations on the concatenated features. Therefore, instead of performing feature fusion operations between RGB features and depth features through addition or concatenation, it utilizes a cross-attention mechanism to calculate the similarity between the two modal features, allowing visual and spatial information to fully interact. The attention information after interaction is then used to enhance the original RGB features, enabling the fusion of visual and spatial information in a simple and efficient manner. This results in a stronger perception of object boundaries by the instance segmentation network, contributing to better instance segmentation results. Attached Figure Description

[0016] Figure 1 This is a flowchart of the instance segmentation system based on multimodal fusion provided in an embodiment of the present invention; Figure 2 This is a network structure diagram of the deep feature extraction network provided in the embodiments of the present invention; Figure 3 This is a network structure diagram of the feature fusion module provided in an embodiment of the present invention; Figure 4 This is a diagram of the segmentation prediction head structure provided in an embodiment of the present invention; Figure 5 This is an overall architecture diagram of image instance segmentation based on multimodal fusion for workpiece stacking images provided by an embodiment of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0018] This invention provides an image instance segmentation system based on multimodal fusion, such as... Figure 1 As shown, it includes: An RGB feature extraction network is used to extract multi-scale RGB feature maps from the RGB image of the object to be segmented.

[0019] Specifically, the input to the RGB feature extraction network is an RGB image of the object to be segmented, acquired by an industrial depth camera or depth sensor, and the output is a corresponding multi-scale RGB feature map. It is understood that industrial depth cameras or depth sensors need to ensure high image quality, minimal noise, and minimal changes in image content due to environmental factors.

[0020] There are various RGB feature extraction networks that can be used, such as MobileNet, YOLO, and Vision Transformer. This embodiment of the invention does not limit the specific network to these networks.

[0021] A deep feature extraction network is used to extract multi-scale depth feature maps from the depth image of the object to be segmented. The multi-scale depth feature map and the multi-scale RGB feature map are one-to-one in scale, and the feature dimensions are consistent at the same scale.

[0022] The deep feature extraction network can adopt existing structures such as MobileNet, YOLO, and VisionTransformer. Considering that the input of the deep feature extraction network is a single-channel depth image, a lightweight network can be used to extract the depth features of the single-channel depth image. Based on this, preferably, the deep feature extraction network includes a standard convolutional layer and N depth-separable convolutional layers connected in sequence.

[0023] Specifically, the input to the depth feature extraction network is a single-channel depth image of the object to be segmented, acquired by a depth camera or depth sensor, and the output is a corresponding multi-scale depth feature map.

[0024] The depth feature extraction network consists of multiple layers of depthwise separable convolutions. These layers extract multi-scale features from the depth image and output depth features at multiple scales. Each scale's depth feature then needs to be fused with RGB features to obtain multi-scale fused features. Therefore, the multi-scale depth feature map output by the depth feature extraction network must maintain the same feature dimension as the multi-scale RGB feature map output by the RGB feature extraction network at each scale.

[0025] Taking the output of three scales of deep feature maps from a deep feature extraction network as an example, such as... Figure 2 As shown, the deep feature extraction network includes a standard convolutional layer (e.g., 3×3) and four depthwise separable convolutional layers connected in sequence. The 3×3 convolution is used to perform preliminary feature projection on the input depth image. Then, through the stacking of multiple depthwise separable convolutions, multi-scale depth feature extraction is achieved. In this example, the feature dimensions output by the first to fourth depthwise separable convolutional layers are 32, 64, 128, and 256, respectively. Since the feature dimensions of the RGB feature maps are 64, 128, and 256, respectively, the scale of the RGB feature maps must correspond one-to-one with the scale of the depth feature maps. Therefore, the depth feature maps with feature dimensions of 64, 128, and 256 output by the second to fourth depthwise separable convolutional layers are used as the depth feature maps of the first to third scales, respectively.

[0026] The local multi-head self-attention module is used to process the depth feature maps at each scale as the target depth feature map to obtain the final enhanced depth features at each scale. The processing includes: projecting the target depth feature maps into query matrices respectively. Key matrix and value matrix ; , , The feature dimension is the same as the feature dimension of the target depth feature map; according to the preset number of attention heads M, respectively... , , The feature dimensions are divided into M groups. , , Feature subgraphs; for the m-th group , , The feature sub-map is divided into I groups of local windows, and the i-th local window is the m-th local window. , , The feature sub-image is a window of size k×k centered on the i-th feature pixel; i=1,2,…,I, where I is the number of feature pixels in the feature sub-image; based on the local query matrix within the i-th local window. Local bond matrix and local value matrix Perform a self-attention operation to obtain the local enhancement features of the i-th local window; merge the local enhancement features of the i-th local window to obtain the m-th local window. , , Enhance the features of the sub-maps of the feature sub-maps; group M , , The enhanced features of the sub-maps are concatenated to obtain the preliminary enhanced depth features, which are then residually connected with the target depth feature map to obtain the final enhanced depth features.

[0027] The local multi-head self-attention module takes the multi-scale depth feature maps output by the deep feature extraction network as input. For each scale depth feature map, it slides through the depth feature map in the form of local windows, performing multi-head self-attention operations on the elements within each local window. This allows depth features to interact with each other in local regions, uncovering continuously distributed spatial information in these regions and enhancing the spatial perception capability of the depth features. Specifically, the local multi-head self-attention module processes the depth feature maps at each scale as target depth feature maps. The processing flow includes five steps: feature projection, multi-head partitioning, window partitioning, self-attention calculation, and multi-head stitching. 1. Feature projection The target depth feature map is processed by three 1×1 convolutional layers respectively. The projection consists of three feature matrices with the same dimensions as the target depth feature map, denoted as the query matrix. Key matrix and value matrix Three parts. , , The feature dimensions are the same as those of the target depth feature map.

[0028] , , The calculation formulas are as follows: ; ; ; in, , , All are learnable weights. , , All of these are learnable biases.

[0029] 2. Multi-head division The feature dimensions of each part (i.e. each feature matrix) are divided equally according to the number of attention heads, so that different operations can be learned to be performed on each attention head.

[0030] For example, assuming the feature dimension of the target depth feature map is 256 and the number of attention heads M=4, then according to the number of attention heads, the following will be applied: , , The feature dimension of 256 is divided into 64, resulting in 4 groups with a feature dimension of 64. , , The feature subgraph.

[0031] 3. Window division Local windows are divided along the feature dimension corresponding to each attention head, with each feature pixel corresponding to a k×k local window.

[0032] Continuing with the example above, assume there are 4 groups with a feature dimension of 64. , , If the feature submap has 100 feature pixels, then for the first group... , , The feature sub-map is divided into 100 local windows, where the first local window includes a k×k region centered on the first feature pixel. , , The local window in the feature submap, and the local windows in other groups are similar; other groups , , The feature subgraphs are similar.

[0033] 4. Self-attention calculation Perform a self-attention operation within each local window to obtain the local augmented features within that local window after the self-attention operation. The calculation formula is: ; in, for The feature matrix of the local window region in the image. and They are respectively and The feature matrix of the local window region at the corresponding position in the middle. For each attention head, T represents the feature dimension, and T is the matrix transpose operation.

[0034] 5. Multi-head splicing After completing the self-attention calculation for each local window, the dimensions of multiple attention heads are re-stitched in the multi-head stitching step to restore the original feature size, thus obtaining the preliminary enhanced deep features. Finally, the preliminary enhanced depth features are residually concatenated with the target depth feature map to obtain the final enhanced depth features. This allows for the introduction of enhanced information about local area interactions while retaining some of the original information.

[0035] Continuing with the example above, in the multi-head splicing step, the first group... , , The first group is obtained by merging the local enhancement features of 100 local windows in the feature submap. , , Enhance the features of the subgraph of the feature subgraph; enhance the features of the second group. , , The second group is obtained by merging the local enhancement features of 100 local windows in the feature submap. , , The feature subgraphs of the feature subgraphs are enhanced with features, and so on, to obtain the 3rd and 4th groups. , , Enhance the features of the sub-maps of the feature sub-maps; then, group 1 to 4. , , The enhanced features of the feature sub-maps are concatenated to obtain the preliminary enhanced depth features. Finally, The final enhanced depth features are obtained by performing residual connections with the target depth feature map. .

[0036] In the aforementioned local multi-head self-attention module, the sliding window concept of convolution and the self-attention mechanism in Transformer are combined. A local window is used to slide through the depth feature map, searching for the neighborhood of each feature pixel. Through self-attention operations within the neighborhood, spatial information with similar distributions is discovered. This spatial information can represent continuous regions belonging to the same object, which helps the image instance segmentation network perceive object regions and improves the instance segmentation effect.

[0037] The feature fusion module is used to fuse the final enhanced depth features at each scale with the RGB feature map to obtain enhanced RGB features, and then add them to the RGB feature map to obtain the fused features at each scale.

[0038] Specifically, the feature fusion module can employ methods such as element-wise addition, averaging, and channel-dimensional splicing.

[0039] Preferably, the fusion feature module achieves information interaction between the two modal features through cross-attention operations, thereby obtaining the final features of visual perception and depth perception. That is, the fusion feature module performs cross-attention calculations on the final enhanced depth features and RGB feature maps at each scale to obtain enhanced RGB features, and adds them to the RGB feature maps to obtain the fusion features at each scale.

[0040] Specifically, feature fusion is achieved by combining RGB features at various scales. and enhanced depth features This is achieved through cross-attention, which is used to obtain enhanced RGB features. The formula for calculating cross-attention operations is: ; in, The number of feature channels for RGB features is represented by T, where T is the matrix transpose operation.

[0041] After completing the cross-attention operation, the enhanced RGB features at each scale are... Compared with the original RGB features The summation serves as the final output feature, and the final output feature across all scales is the multi-scale feature. This preserves the original visual appearance information while introducing depth information to guide the feature's perception of space.

[0042] The segmentation prediction head includes a category branch, a bounding box branch, and a mask branch, which are used to process multi-scale features to obtain the category, bounding box, and mask of each object in the object to be segmented.

[0043] Specifically, the input to the segmentation prediction head is multi-scale features. By performing dimensionality reduction and prediction on these features, the output is the category, bounding box, and mask of each object in the object to be segmented, thus achieving instance segmentation.

[0044] Each branch of the segmentation prediction head predicts relevant information according to different output dimension requirements: the category prediction head (i.e., the category branch) predicts the category of the target, the bounding box prediction head (i.e., the bounding box branch) predicts the position and size of the target bounding box, and the mask prediction head (i.e., the mask branch) predicts the pixel region of the target mask.

[0045] Preferably, both the category branch and the bounding box branch include a standard convolutional layer, a BN normalization layer, and a ReLU activation function layer connected in sequence; The mask branch includes a standard convolutional layer, an upsampling layer, and a standard convolutional layer connected in sequence.

[0046] For example, both the category branch and the bounding box branch can perform dimensionality reduction mapping on the features using 1×1 convolutional layers, normalization layers, and activation function layers. The category branch maps the output dimension to the number of categories, while the bounding box branch maps the output dimension to 4, including the coordinates and width and height of the object's bounding box. The masking branch upsamples the feature map to the original size using 3×3 convolutional layers and upsampling layers. Figure 1 The mask is calculated by taking a size of 4 / 4, extracting the mask coefficients, multiplying the original feature vector by the mask coefficients to obtain the predicted mask, and then using bilinear interpolation upsampling to restore the mask size to the original image size, thus obtaining the final output mask.

[0047] Preferably, the target category, target bounding box, and target mask output by the segmentation prediction head module can be output to the user interface in a visual form (such as outputting the mask pixel range, marking the object mask region in the image, etc.) through the result output module, or output to the downstream system for subsequent processing and task execution.

[0048] This invention provides a training method for an image instance segmentation system based on multimodal fusion as described in any of the above embodiments, comprising: The RGB feature extraction network, deep feature extraction network, local multi-head self-attention module, feature fusion module, and segmentation prediction head of the image instance segmentation system are trained in a supervised manner using a training dataset.

[0049] Specifically, before training the image instance segmentation system, it is preferable to pre-train the RGB feature extraction network. After preprocessing the image samples, training is performed according to conventional training methods. To ensure that the positions of the RGB image samples and the depth image samples remain consistent, the preprocessing operation only includes normalization and does not include operations such as cropping, rotation, or flipping.

[0050] The pre-training process of the RGB feature extraction network includes: 1. Training data preparation: For example, use a simulation dataset as the training set and a real dataset as the test set. Generally, industrial datasets include simulation datasets and real datasets. Simulation datasets are relatively inexpensive and difficult to produce, unlike real datasets which require a lot of manpower and time. 2. Constructing the network architecture: Taking the backbone architecture of the classic instance segmentation network (YOLO11-seg) as an example as the RGB feature extraction network, this network takes the RGB image as input and uses the output of the backbone architecture at three different scales as the output features of the RGB feature extraction network. 3. Network Training: The training weights of the RGB feature extraction network are obtained by retraining it on the task training dataset. Some feature extraction networks have pre-trained weights obtained from training on large-scale datasets. These pre-trained weights can help the feature extraction network extract general features in different task scenarios. Therefore, one can choose to retain the pre-trained weights and directly perform feature extraction, or fine-tune the pre-trained weights for the task dataset, or choose not to use the pre-trained weights and instead retrain from the initial weights on the task dataset to extract more targeted features. The above training strategies are not limited in this invention.

[0051] The overall training process of the image instance segmentation system includes: joint supervised training of all modules of the image instance segmentation system, including the RGB feature extraction network, deep feature extraction network, local multi-head self-attention module, feature fusion module, and segmentation prediction head, according to a unified training strategy. Except for the RGB feature extraction network, which can use pre-trained weights, all other modules need to be retrained.

[0052] This invention provides an image instance segmentation method based on multimodal fusion, comprising: The RGB image and depth image of the object to be segmented are input into the image instance segmentation system based on multimodal fusion as described in any of the above embodiments to obtain the category, bounding box and mask of each object in the object to be segmented.

[0053] This invention provides an electronic device, including: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is used to read executable instructions stored in the computer-readable storage medium and execute the training method or segmentation method as described in any of the above embodiments.

[0054] This invention provides a computer-readable storage medium storing computer instructions for causing a processor to execute the training method or segmentation method as described in any of the above embodiments.

[0055] This invention provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the training method or segmentation method as described in any of the above embodiments.

[0056] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A multi-modal fusion based image instance segmentation system, characterized in that, include: An RGB feature extraction network is used to extract multi-scale RGB feature maps from the RGB image of the object to be segmented. A deep feature extraction network is used to extract multi-scale depth feature maps from the depth image of the object to be segmented. The multi-scale depth feature map and the multi-scale RGB feature map correspond one-to-one in scale, and the feature dimensions remain consistent at the same scale. The local multi-head self-attention module is used to process the depth feature maps at each scale as the target depth feature map to obtain the final enhanced depth features at each scale. The processing includes: projecting the target depth feature maps into query matrices respectively. Key matrix and value matrix ; , , The feature dimension is the same as the feature dimension of the target depth feature map; according to the preset number of attention heads M, respectively... , , The feature dimensions are divided into M groups. , , Feature subgraphs, M>1; for the m-th group , , The feature sub-map is divided into I groups of local windows, and the i-th local window is the m-th local window. , , The feature sub-image is a window of size k×k centered on the i-th feature pixel, where m=1,2,…,M, k>0, i=1,2,…,I, and I is the number of feature pixels in the feature sub-image; based on the local query matrix within the i-th local window... Local bond matrix and local value matrix Perform a self-attention operation to obtain the local enhancement features of the i-th local window; merge the local enhancement features of the i-th local window to obtain the m-th local window. , , Enhance the features of the sub-maps of the feature sub-maps; group M , , The sub-image enhancement features of the feature sub-image are concatenated to obtain the preliminary enhanced depth features, and then residual connection is performed between them and the target depth feature map to obtain the final enhanced depth features. The feature fusion module is used to fuse the final enhanced depth features at each scale with the RGB feature map to obtain enhanced RGB features, and add them to the RGB feature map to obtain the fused features at each scale. The segmentation prediction head includes a category branch, a bounding box branch, and a mask branch, which are used to process the fused features at all scales to obtain the category, bounding box, and mask of each object in the object to be segmented.

2. The system as described in claim 1, characterized in that, The fusion feature module performs feature fusion with the final enhanced depth features and RGB feature maps at each scale based on the cross-attention mechanism to obtain enhanced RGB features.

3. The system as described in claim 1 or 2, characterized in that, The deep feature extraction network includes a standard convolutional layer and N depth-separable convolutional layers connected in sequence.

4. The system as described in claim 1 or 2, characterized in that, Both the category branch and the bounding box branch include a standard convolutional layer, a BN normalization layer, and a ReLU activation function layer connected in sequence. The mask branch includes a standard convolutional layer, an upsampling layer, and a standard convolutional layer connected in sequence.

5. A training method for an image instance segmentation system based on multimodal fusion as described in any one of claims 1-4, characterized in that, include: The RGB feature extraction network, deep feature extraction network, local multi-head self-attention module, feature fusion module, and segmentation prediction head of the image instance segmentation system are trained in a supervised manner using a training dataset.

6. An image instance segmentation method based on multimodal fusion, characterized in that, include: The RGB image and depth image of the object to be segmented are input into the image instance segmentation system based on multimodal fusion as described in any one of claims 1-4 to obtain the category, bounding box and mask of each object in the object to be segmented.

7. An electronic device, characterized in that, include: Computer-readable storage media and processors; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the training method as described in claim 5 or the segmentation method as described in claim 6.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing a processor to perform the training method as described in claim 5 or the segmentation method as described in claim 6.

9. A computer program product, comprising a computer program or instructions, characterized in that, When the computer program or instructions are executed by the processor, they implement the training method as described in claim 5 or the segmentation method as described in claim 6.