Asymmetric fusion based multi-modal semantic segmentation method, device and medium
By employing an asymmetric fusion-based multimodal semantic segmentation method, and utilizing a dual-stream feature extraction module and a gated complementary feature fusion module, unidirectional collaboration between the dominant and auxiliary stream branches is achieved. This solves the problem of auxiliary modal noise contaminating RGB modal features, thereby improving the accuracy and robustness of image semantic segmentation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WUYI UNIV
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-23
Smart Images

Figure CN122265643A_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this application relate to, but are not limited to, the field of image processing technology, and particularly to a multimodal semantic segmentation method, device, and medium based on asymmetric fusion. Background Technology
[0002] Image semantic segmentation requires pixel-level classification and understanding of scenes. In the development of deep learning technology, segmentation methods based on convolutional neural networks and Transformer architectures achieve good results in normal environments. However, in complex scenes such as low-light conditions, adverse weather, and visual camouflage, relying solely on information provided by the RGB modality is insufficient for stable perception tasks. To compensate for the lack of information in a single RGB modality, existing technologies typically introduce auxiliary modalities (X-modalities) such as thermal imaging or depth information to form multimodal segmentation schemes. Most current multimodal semantic segmentation methods employ symmetrical feature fusion. These methods do not consider the actual situation where there are differences in signal-to-noise ratios between different modalities. Background noise present in the auxiliary modality can be transmitted to the RGB modality through bidirectional interaction, causing contamination of RGB semantic features and thus affecting the overall segmentation effect. Furthermore, conventional linear fusion methods cannot effectively filter auxiliary information, making it difficult to fully utilize effective auxiliary information while protecting the dominant modality features. This results in segmentation accuracy and robustness in complex scenes failing to meet practical requirements. Summary of the Invention
[0003] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.
[0004] This application provides a multimodal semantic segmentation method, device, and medium based on asymmetric fusion, which can effectively suppress auxiliary modal noise transmission, protect dominant features, and improve the accuracy and robustness of image semantic segmentation in complex environments.
[0005] This application provides a multimodal semantic segmentation method based on asymmetric fusion, comprising: acquiring a color modal image to be processed and a corresponding X modal image; inputting the color modal image and the X modal image into a pre-trained multimodal semantic segmentation network, wherein the multimodal semantic segmentation network is provided with a dual-stream feature extraction module and a gated complementary feature fusion module, the dual-stream feature extraction module is provided with a main stream branch and an auxiliary stream branch, the main stream branch is provided with a first feature extraction layer and a first output layer connected in sequence, and the auxiliary stream branch is provided with a second feature extraction layer and a second output layer connected in sequence; one end of the gated complementary feature fusion module is unidirectionally connected to the second feature extraction layer, and the other end is bidirectionally connected to the first feature extraction layer; and feature extraction is performed on the color modal image through the first feature extraction layer. The process involves: obtaining the original dominant features; extracting features from the X-modal image using the second feature extraction layer to obtain the original auxiliary features; processing the original dominant features using the gated complementary feature fusion module to obtain the dominant channel descriptor vector, and processing the original auxiliary features to obtain the auxiliary channel descriptor vector; calculating the enhanced dominant features based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant features, and the original auxiliary features; performing feature segmentation processing on the enhanced dominant features using the first output layer to obtain the first segmentation result; performing feature segmentation processing on the original auxiliary features using the second output layer to obtain the second segmentation result; and obtaining a multimodal semantic segmentation map based on the first segmentation result and the second segmentation result.
[0006] In one embodiment of this application, the gated complementary feature fusion module includes a first feature processing unit, a second feature processing unit, a gated weight generation unit, and a feature enhancement unit. The first feature processing unit includes a first pooling layer and a first multilayer perceptron connected in sequence. The second feature processing unit includes a second pooling layer and a second multilayer perceptron connected in sequence. The output terminals of the first multilayer perceptron and the second multilayer perceptron are connected to the gated weight generation unit, and the gated weight generation unit is connected to the feature enhancement unit.
[0007] In one embodiment of this application, the step of performing feature processing on the original dominant feature through the gated complementary feature fusion module to obtain the dominant channel descriptor vector includes: performing pooling processing on the original dominant feature through the first pooling layer to obtain dominant pooled features; and performing classification processing on the dominant pooled features through the first multilayer perceptron to obtain the dominant channel descriptor vector.
[0008] In one embodiment of this application, the step of performing feature processing on the original auxiliary features to obtain an auxiliary channel descriptor vector includes: performing pooling processing on the original auxiliary features through a second pooling layer to obtain auxiliary pooled features; and performing classification processing on the auxiliary pooled features through a second multilayer perceptron to obtain an auxiliary channel descriptor vector.
[0009] In one embodiment of this application, the step of calculating the enhanced dominant feature based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant feature, and the original auxiliary feature includes: calculating a gating weight based on the dominant channel descriptor vector and the auxiliary channel descriptor vector; calculating a preliminary fusion feature based on the gating weight, the original dominant feature, and the original auxiliary feature; performing spatial attention calculation on the preliminary fusion feature to obtain a spatial attention map; and calculating the enhanced dominant feature based on the spatial attention map and the preliminary fusion feature.
[0010] In one embodiment of this application, the training steps of the multimodal semantic segmentation network include: constructing an initial multimodal semantic segmentation model, the initial multimodal semantic segmentation model including an initial two-stream feature extraction module and a gated complementary feature fusion module, the initial two-stream feature extraction module including an initial dominant stream branch and an initial auxiliary stream branch, the initial dominant stream branch having an initial first feature extraction layer and an initial first output layer connected in sequence, and the initial auxiliary stream branch having an initial second feature extraction layer and an initial second output layer connected in sequence; obtaining a training sample set, the training sample set including color modal image samples and corresponding X modal image samples; performing feature extraction on the color modal image samples through the initial first feature extraction layer to obtain training dominant features; and performing feature extraction on the X modal image samples through the initial second feature extraction layer to obtain training auxiliary features. The training dominant feature is processed by the gated complementary feature fusion module to obtain a training dominant channel descriptor vector, and the training auxiliary feature is processed to obtain a training auxiliary channel descriptor vector. Based on the training dominant channel descriptor vector, the initial training auxiliary channel descriptor vector, the initial training dominant feature, and the training auxiliary feature, a training enhancement dominant feature is calculated. The training enhancement dominant feature is processed by the initial first output layer to obtain a first training segmentation result. The training auxiliary feature is processed by the initial second output layer to obtain a second training segmentation result. According to the preset target loss function, the first training segmentation result, and the second training segmentation result, the parameters of the initial dominant flow branch and the initial auxiliary flow branch are adjusted to obtain a trained multimodal semantic segmentation network.
[0011] In one embodiment of this application, the target loss function is calculated according to the following steps: calculating the auxiliary modality components contained in the training dominant features based on the training auxiliary channel descriptor vector and the training enhancement dominant features; calculating a first loss function based on the auxiliary modality components and the training auxiliary features; generating a semantic mask based on the first training segmentation result; calculating a second loss function based on the semantic mask, the first training segmentation result, and the second training segmentation result; and obtaining the target loss function based on the first loss function, the second loss function, and the segmentation cross-entropy loss.
[0012] In one embodiment of this application, the first output layer includes a first decoder and a first segmentation head; the first segmentation result is obtained by performing feature processing on the enhanced dominant features through the first output layer, including: using the first decoder to perform feature decoding processing on the enhanced dominant features to obtain dominant modality decoded features; and using the first segmentation head to perform predictive segmentation processing on the dominant modality decoded features to obtain the first segmentation result.
[0013] On the other hand, embodiments of this application provide an electronic device, including: at least one processor; at least one memory for storing at least one program; and when at least one of the programs is executed by at least one of the processors, implementing the multimodal semantic segmentation method as described above.
[0014] On the other hand, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions for performing the multimodal semantic segmentation method as described above.
[0015] This application provides a multimodal semantic segmentation method, electronic device, and computer-readable storage medium based on asymmetric fusion. The method first acquires a color modal image to be processed and its corresponding X modal image. Then, these two modal images are input into a pre-trained multimodal semantic segmentation network for image processing to obtain the corresponding multimodal semantic segmentation map. The multimodal semantic segmentation network includes a dual-stream feature extraction module and a gated complementary feature fusion module. The dual-stream feature extraction module has a dominant stream branch and an auxiliary stream branch. The dominant stream branch has a first feature extraction layer and a first output layer connected in sequence, and the auxiliary stream branch has a second feature extraction layer and a second output layer connected in sequence. One end of the gated complementary feature fusion module is unidirectionally connected to the second feature extraction layer, and the other end is bidirectionally connected to the first feature extraction layer. In the image processing process of this multimodal semantic segmentation network, features are first extracted from the color modality image through a first feature extraction layer to obtain the original dominant features, and features are extracted from the X modality image through a second feature extraction layer to obtain the original auxiliary features. Then, the original dominant features are processed by a gated complementary feature fusion module to obtain the dominant channel descriptor vector, and the original auxiliary features are processed to obtain the auxiliary channel descriptor vector. Next, based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant features, and the original auxiliary features, the enhanced dominant features are calculated. Subsequently, the enhanced dominant features are segmented by a first output layer to obtain the first segmentation result, and the original auxiliary features are segmented by a second output layer to obtain the second segmentation result. Based on the first and second segmentation results, the corresponding multimodal semantic segmentation map can be obtained. In this embodiment, the connection between the gated complementary feature fusion module and the first and second feature extraction layers establishes a unidirectional collaborative working mode between the dominant and auxiliary features. The gated complementary feature fusion module extracts features from the X modality, which can enhance the dominant features with auxiliary features while ensuring that the dominant features are not destroyed. At the same time, it suppresses the transmission of noise in the X modality, thereby improving the accuracy and robustness of image semantic segmentation in complex environments. Attached Figure Description
[0016] Figure 1 This is a flowchart of a multimodal semantic segmentation method provided in one embodiment of this application; Figure 2 This is an overall architecture diagram of a multimodal semantic segmentation network provided in one embodiment of this application; Figure 3 This is a structural diagram of a gated complementary feature fusion module provided in one embodiment of this application; Figure 4 This is provided in one embodiment of the present application. Figure 1 The detailed flowchart of step 150; Figure 5This is a flowchart of the training process of a multimodal semantic segmentation network provided in one embodiment of this application; Figure 6 This is a flowchart illustrating the construction of the target loss function according to one embodiment of this application; Figure 7 This is a flowchart illustrating the training logic of a multimodal semantic segmentation network provided in one embodiment of this application. Detailed Implementation
[0017] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0018] It should be noted that although the flowchart shows a logical order, in some cases, the steps shown or described may be performed in a different order than that shown in the flowchart. The terms "first," "second," etc., used in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the structures, proportions, sizes, etc., depicted in the drawings are only used to complement the content disclosed in the specification for those skilled in the art to understand and read, and are not intended to limit the implementation conditions of this application. Therefore, they have no substantial technical significance. Any modifications to the structure, changes in proportions, or adjustments to size, without affecting the effects and purposes achieved by this application, should still fall within the scope of the technical content disclosed in this application. Similarly, the terms such as "upper," "lower," "left," "right," "middle," and "one" used in this specification are only for clarity of description and are not used to limit the scope of implementation of this application. Changes or adjustments in their relative relationships, without substantially altering the technical content, should also be considered within the scope of implementation of this application.
[0019] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0020] Image semantic segmentation requires pixel-level classification and understanding of scenes. While segmentation methods based on convolutional neural networks and Transformer architectures have achieved good results in conventional environments during the development of deep learning technology, they struggle to perform stably in complex scenes such as low-light conditions, adverse weather, and visual camouflage, relying solely on information provided by the red-green-blue (RGB) modality is insufficient for stable perception tasks. To compensate for the lack of information in a single RGB modality, existing technologies typically introduce auxiliary modalities (X-modalities) such as thermal imaging or depth information to form multimodal segmentation schemes. Most current multimodal semantic segmentation methods employ symmetrical feature fusion, which fails to consider the differences in signal-to-noise ratios between different modalities. Background noise in the auxiliary modality can be transmitted to the RGB modality through bidirectional interaction, contaminating RGB semantic features and affecting the overall segmentation performance. Furthermore, conventional linear fusion methods cannot effectively filter auxiliary information, making it difficult to fully utilize effective auxiliary information while protecting the dominant modality features. This results in segmentation accuracy and robustness in complex scenes failing to meet practical requirements.
[0021] In view of this, embodiments of this application provide a multimodal semantic segmentation method, electronic device, and computer-readable storage medium based on asymmetric fusion. The method first acquires the color modality image to be processed and the corresponding X modality image, and then inputs these two modality images into a pre-trained multimodal semantic segmentation network for image processing to obtain the corresponding multimodal semantic segmentation map. This multimodal semantic segmentation network includes a dual-stream feature extraction module and a gated complementary feature fusion module. The dual-stream feature extraction module has a dominant stream branch and an auxiliary stream branch. The dominant stream branch has a first feature extraction layer and a first output layer connected in sequence, and the auxiliary stream branch has a second feature extraction layer and a second output layer connected in sequence. One end of the gated complementary feature fusion module is unidirectionally connected to the second feature extraction layer, and the other end is bidirectionally connected to the first feature extraction layer. In the image processing process of this multimodal semantic segmentation network, features are first extracted from the color modality image through a first feature extraction layer to obtain the original dominant features, and features are extracted from the X modality image through a second feature extraction layer to obtain the original auxiliary features. Then, the original dominant features are processed by a gated complementary feature fusion module to obtain the dominant channel descriptor vector, and the original auxiliary features are processed to obtain the auxiliary channel descriptor vector. Next, based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant features, and the original auxiliary features, the enhanced dominant features are calculated. Subsequently, the enhanced dominant features are segmented by a first output layer to obtain the first segmentation result, and the original auxiliary features are segmented by a second output layer to obtain the second segmentation result. Based on the first and second segmentation results, the corresponding multimodal semantic segmentation map can be obtained. In this embodiment, the connection between the gated complementary feature fusion module and the first and second feature extraction layers establishes a unidirectional collaborative working mode between the dominant and auxiliary features. The gated complementary feature fusion module extracts features from the X modality, which can enhance the dominant features with auxiliary features while ensuring that the dominant features are not destroyed. At the same time, it suppresses the transmission of noise in the X modality, thereby improving the accuracy and robustness of image semantic segmentation in complex environments.
[0022] The embodiments of this application will be further described below with reference to the accompanying drawings.
[0023] Reference Figure 1 , Figure 1 This is a flowchart of a multimodal semantic segmentation method provided in one embodiment of this application. The process may include, but is not limited to, steps 110 to 170.
[0024] Step 110: Obtain the color modal image to be processed and the corresponding X modal image; Step 120: Input the color modal image and the X modal image into a pre-trained multimodal semantic segmentation network. The multimodal semantic segmentation network has a dual-stream feature extraction module and a gated complementary feature fusion module. The dual-stream feature extraction module has a main stream branch and an auxiliary stream branch. The main stream branch has a first feature extraction layer and a first output layer connected in sequence. The auxiliary stream branch has a second feature extraction layer and a second output layer connected in sequence. One end of the gated complementary feature fusion module is unidirectionally connected to the second feature extraction layer, and the other end is bidirectionally connected to the first feature extraction layer. Step 130: Extract features from the color modality image through the first feature extraction layer to obtain the original dominant features; extract features from the X modality image through the second feature extraction layer to obtain the original auxiliary features; Step 140: The original dominant features are processed by the gated complementary feature fusion module to obtain the dominant channel descriptor vector, and the original auxiliary features are processed to obtain the auxiliary channel descriptor vector. Step 150: Calculate the enhanced dominant feature based on the dominant channel descriptor vector, auxiliary channel descriptor vector, original dominant feature, and original auxiliary feature; Step 160: Perform feature segmentation on the enhanced dominant features through the first output layer to obtain the first segmentation result; perform feature segmentation on the original auxiliary features through the second output layer to obtain the second segmentation result; Step 170: Based on the first segmentation result and the second segmentation result, obtain the multimodal semantic segmentation map.
[0025] As can be understood, color modal images refer to conventional visible light images (i.e., RGB images) formed by combining the red, green, and blue channels, such as everyday photos of natural scenes and visible light images captured by surveillance equipment. X-modal images can be thermal images, depth maps, LiDAR point clouds, or event camera data. The corresponding X-modal images refer to scene feature images obtained through non-visible light sensing methods, such as thermal images reflecting the temperature distribution of objects, depth images reflecting spatial distance information, LiDAR point cloud images characterizing three-dimensional spatial structures, and event camera data images recording changes in light.
[0026] In one feasible embodiment, when the acquired color modal image is an outdoor scene visible light image, its corresponding X-modal image can be a thermal imaging image reflecting the temperature distribution of the scene. When the acquired color modal image is an indoor environment visible light image, its corresponding X-modal image can be a depth image reflecting the spatial distance information of the scene. When the acquired color modal image is a road driving visible light image, its corresponding X-modal image can be a LiDAR point cloud image characterizing the three-dimensional structure of the road surface and obstacles. When the acquired color modal image is a high-speed moving scene visible light image, its corresponding X-modal image can be an event camera data image recording changes in scene brightness.
[0027] In one feasible embodiment, such as Figure 2 As shown, the overall architecture of the multimodal semantic segmentation network mainly includes a two-stream feature extraction module and a gated complementary feature fusion module. The two-stream feature extraction module has a physically parallel but logically asymmetric two-stream structure, namely a dominant stream branch and an auxiliary stream branch. The dominant stream branch is used to process high signal-to-noise ratio RGB images, responsible for constructing the basic semantic skeleton of the scene; the auxiliary stream branch is used to process X-modal images to capture specific physical cues. Furthermore, the dominant stream branch has a first feature extraction layer and a first output layer connected in sequence, and the auxiliary stream branch has a second feature extraction layer and a second output layer connected in sequence. Each stage of the first and second feature extraction layers has a corresponding gated complementary feature fusion module. One end of this module is bidirectionally connected to the corresponding stage of the first feature extraction layer, and the other end is unidirectionally connected to the same stage of the second feature extraction layer. Through this connection architecture, the network can establish a one-way information flow channel between two branches, that is, only information is allowed to flow from the auxiliary flow branch to the dominant flow branch, and the semantic information of the dominant flow is strictly prohibited from being transmitted back to the auxiliary flow. This ensures that the auxiliary flow branch can independently extract physical features and avoid being assimilated by the strong semantics of the dominant flow branch.
[0028] In a feasible embodiment, both the first feature extraction layer and the second feature extraction layer can use a hybrid Transformer as the backbone network.
[0029] In one feasible embodiment, the first output layer includes a first decoder and a first segmentation head, and the second output layer includes a second decoder and a second segmentation head. The first decoder and the first segmentation head work together to generate a first segmentation result based on the dominant features output by the first feature extraction layer; the second decoder and the second segmentation head work together to generate a second segmentation result based on the auxiliary features output by the second feature extraction layer.
[0030] In one feasible embodiment, such as Figure 3As shown, the gated complementary feature fusion module comprises a first feature processing unit, a second feature processing unit, a gated weight generation unit, and a feature enhancement unit. The first feature processing unit includes a first pooling layer and a first multilayer perceptron connected in sequence. The second feature processing unit includes a second pooling layer and a second multilayer perceptron connected in sequence. Both the first and second pooling layers include interconnected max-pooling and average-pooling layers. The output of the average-pooling layer is connected to the input of the corresponding multilayer perceptron. The outputs of the first and second multilayer perceptrons are connected to the gated weight generation unit. The feature enhancement unit includes a 1×1 convolutional layer and a sigmoid function layer, and the gated weight generation unit is connected to this feature enhancement unit.
[0031] In a feasible embodiment, in the process of performing feature processing on the original dominant features to obtain the dominant channel descriptor vector, the original dominant features can first be pooled through a first pooling layer to obtain dominant pooled features; then, the dominant pooled features can be classified through a first multilayer perceptron to obtain the dominant channel descriptor vector.
[0032] In a feasible embodiment, during the process of performing feature processing on the original auxiliary features to obtain the auxiliary channel descriptor vector, the original auxiliary features can first be pooled through a second pooling layer to obtain auxiliary pooled features; then, the auxiliary pooled features can be classified through a second multilayer perceptron to obtain the auxiliary channel descriptor vector.
[0033] It should be understood that a channel descriptor vector refers to vector data obtained after global information aggregation and feature representation of features, which can be used to represent image features or global contextual information contained in feature maps.
[0034] In one feasible embodiment, it can be defined For dominant modal inputs with high signal-to-noise ratio (such as RGB images). The auxiliary modal input (i.e., X-modal image, such as thermal imaging or depth image) is then fed into a physically parallel but logically asymmetric encoder (i.e., feature extraction layer). and Specifically, the dominant modality input is fed into the first feature extraction layer. The auxiliary modality input is fed into the second feature extraction layer. In the feature extraction stage Each stage, auxiliary flow characteristics The dominant flow can be injected unidirectionally, subsequently generating enhanced dominant features. The auxiliary flow itself does not receive any feedback, thus ensuring the auxiliary features. Maintain independent distribution of physical characteristics.
[0035] In one feasible embodiment, the characteristics of the dominant flow branch (i.e., the original dominant feature) and the features of auxiliary flow branches After the original auxiliary features are input into the gated complementary feature fusion module, they can be processed through the first pooling layer of this module. Perform global average pooling ( ) and global max pooling ( ) operation, and then through the first multilayer sensor ( After pooling The process involves extracting the dominant channel descriptor vector, which represents the global contextual information included in the dominant features. Similarly, a second pooling layer can also be used to... Perform global average pooling ( ) and global max pooling ( ) operation, and then through the second multilayer sensor ( After pooling The process involves extracting auxiliary channel descriptor vectors that represent the global contextual information included in the auxiliary features. .
[0036] In one feasible embodiment, The calculation formula is shown in equation (1):
[0037] The calculation formula is shown in equation (2):
[0038] In one feasible embodiment, after calculating the dominant channel descriptor vector Auxiliary channel descriptor vector Subsequently, enhanced dominant features can be calculated based on the dominant channel descriptor vector, auxiliary channel descriptor vector, original dominant features, and original auxiliary features. These enhanced dominant features can improve the network's ability to extract key semantic information. For example... Figure 4 As shown, the execution process of step 150 may include, but is not limited to, steps 410 to 440.
[0039] Step 410: Calculate the gating weights based on the dominant channel descriptor vector and the auxiliary channel descriptor vector; Step 420: Calculate the preliminary fusion features based on the gating weights, the original dominant features, and the original auxiliary features; Step 430: Perform spatial attention calculation on the preliminary fusion features to obtain a spatial attention map; Step 440: Calculate the enhanced dominant features based on the spatial attention map and the preliminary fusion features.
[0040] In one feasible embodiment, the dominant channel descriptor vector can be calculated first. and auxiliary channel descriptor vector The element-wise product is used to determine the sign correlation. If the product of a certain channel is negative, it means that the activation directions of the two modes in that dimension are opposite, that is, there is complementarity; otherwise, it is regarded as feature conflict or redundancy.
[0041] In one feasible embodiment, the Sigmoid function can be used to... The element-wise product is processed to generate dynamic gating weights. This makes the weights of complementary features approach 1, while the weights of noisy features approach 0. The calculation formula is shown in equation (3):
[0042] in, ⊙ represents the Sigmoid activation function, and ⊙ represents element-wise multiplication.
[0043] In one feasible embodiment, the gated weights Original auxiliary features after filtering Injecting into the dominant space can yield preliminary fusion characteristics. The calculation formula is shown in equation (4):
[0044] In one feasible embodiment, to enhance spatial saliency, spatial attention can be calculated on the preliminary fused features to obtain a spatial attention map. Subsequently, according to and The enhanced dominant features can be calculated. The calculation formulas are shown in equations (5) and (6):
[0045]
[0046] Through the above formulas (1) to (6), automatic filtering of auxiliary noise and spatial enhancement of effective information can be achieved.
[0047] In a feasible embodiment, after calculating the enhanced dominant features, feature processing can be performed on the enhanced dominant features through a first output layer to obtain a first segmentation result. This process includes: firstly, using a first decoder to perform feature decoding processing on the enhanced dominant features to obtain dominant modality decoded features; then, using a first segmentation head to perform predictive segmentation processing on the dominant modality decoded features to obtain the first segmentation result. Specifically, the enhanced dominant features are first upsampled and reconstructed using the first decoder to obtain a high-level semantic feature map (i.e., dominant modality decoded features), and then the high-level semantic feature map is classified and pixel mapped using the first segmentation head to obtain the first segmentation result.
[0048] In a feasible embodiment, during the process of obtaining the second segmentation result by performing feature processing on the original auxiliary features through the second output layer, the original auxiliary features can first be decoded using the second decoder to obtain auxiliary modality decoded features; then, the auxiliary modality decoded features can be predicted and segmented using the second segmentation head to obtain the second segmentation result. Specifically, the original auxiliary features are first upsampled and reconstructed using the second decoder to obtain an auxiliary semantic feature map (i.e., auxiliary modality decoded features), and then the auxiliary semantic feature map is classified and pixel mapped using the second segmentation head to obtain the second segmentation result.
[0049] In a feasible embodiment, during the process of obtaining a multimodal semantic segmentation map based on the first segmentation result and the second segmentation result, an additive integration strategy can be adopted to superimpose the first segmentation result output by the dominant flow branch and the second segmentation result output by the auxiliary flow branch at the pixel level, and obtain the final multimodal semantic segmentation map through the information complementarity of the two types of segmentation results.
[0050] See Figure 5 The training process of the multimodal semantic segmentation network in this embodiment may include, but is not limited to, steps 510 to 590.
[0051] Step 510: Construct the initial multimodal semantic segmentation model; Step 520: Obtain the training sample set, which includes color modal image samples and corresponding X modal image samples; Step 530: Extract features from the color modality image samples through the initial first feature extraction layer to obtain the training dominant features; Step 540: Extract features from the X modality image samples through the initial second feature extraction layer to obtain training auxiliary features; Step 550: Perform feature processing on the training dominant features through the gated complementary feature fusion module to obtain the training dominant channel descriptor vector, and perform feature processing on the training auxiliary features to obtain the training auxiliary channel descriptor vector; Step 560: Based on the training dominant channel descriptor vector, the training initial auxiliary channel descriptor vector, the training initial dominant feature, and the training auxiliary feature, calculate the training enhanced dominant feature; Step 570: Perform feature processing on the training enhancement dominant features through the initial first output layer to obtain the first training segmentation result; Step 580: Perform feature processing on the training auxiliary features through the initial second output layer to obtain the second training segmentation result; Step 590: Based on the preset target loss function, the first training segmentation result, and the second training segmentation result, adjust the parameters of the initial dominant flow branch and the initial auxiliary flow branch to obtain the trained multimodal semantic segmentation network.
[0052] In a feasible embodiment, the initial multimodal semantic segmentation model includes an initial dual-stream feature extraction module and a gated complementary feature fusion module. The initial dual-stream feature extraction module includes an initial dominant stream branch and an initial auxiliary stream branch. The initial dominant stream branch has an initial first feature extraction layer and an initial first output layer connected in sequence, and the initial auxiliary stream branch has an initial second feature extraction layer and an initial second output layer connected in sequence. Each stage of the initial first feature extraction layer and the initial second feature extraction layer is provided with a corresponding gated complementary feature fusion module. One end of the module is bidirectionally connected to the corresponding stage of the initial first feature extraction layer, and the other end is unidirectionally connected to the same stage of the initial second feature extraction layer.
[0053] In a feasible embodiment, after the training sample set (i.e., color modal image samples and corresponding X modal image samples) is input into the initial multimodal semantic segmentation model, the processing in steps 530 to 580 is similar to the processing in steps 130 to 160 described above. The specific processing flow can be referred to the relevant embodiments above, and will not be repeated here.
[0054] In a feasible embodiment, the objective loss function is primarily used to ensure that, during the training of the multimodal semantic segmentation model, the dominant flow branch can effectively guide the auxiliary flow branch, while simultaneously ensuring that its own parameter updates are not disturbed by the uncertainty of auxiliary information. For example... Figure 6 As shown, the target loss function can be calculated according to steps 610 to 650.
[0055] Step 610: Calculate the auxiliary modality components contained in the training dominant features based on the training auxiliary channel descriptor vector and the training enhancement dominant features; Step 620: Calculate the first loss function based on the auxiliary modal components and training auxiliary features; Step 630: Generate a semantic mask based on the first training segmentation result; Step 640: Calculate the second loss function based on the semantic mask, the first training segmentation result, and the second training segmentation result; Step 650: Obtain the target loss function based on the first loss function, the second loss function, and the split cross-entropy loss.
[0056] In one feasible embodiment, the process mainly involves constructing a first loss function using an asymmetric feature alignment (AFA) strategy and a second loss function using a region-aware decision alignment (RDA) strategy.
[0057] In this embodiment, the auxiliary channel descriptor vector is first trained. and training to enhance dominant features The symbol consistency is used to generate a mask, and the latent auxiliary modal components in the dominant features are extracted as the learning target for the auxiliary flow branch. And force the auxiliary flow branch to fit this objective. The calculation process is shown in equation (7):
[0058] Then, the first loss function is calculated based on the auxiliary modal components and training auxiliary features. , Forced training of auxiliary features The gradient of the target term is suitable for the target but blocks the target term. The calculation process is shown in equation (8):
[0059] Next, a semantic mask is generated using the high-confidence prediction results of the dominant flow branch (i.e., the first training segmentation results). The loss of the auxiliary flow branch is calculated only in regions considered reliable by the dominant flow branch (such as vehicles and roads), while its prediction is ignored in background regions (such as the sky), thereby suppressing background noise of the auxiliary modalities. The probability is predicted using the dominant flow branch. The maximum value index generates a high-confidence semantic mask. The calculation process is shown in equation (9):
[0060] Using this mask to constrain the pre-side of the auxiliary flow branch (i.e., the second training segmentation result), L1 loss is calculated only in reliable regions to obtain the second loss function. :
[0061] Subsequently, according to the first loss function Second loss function and basic splitting cross-entropy loss By performing weighted calculations, the target loss function can be obtained. :
[0062] See Figure 7 , Figure 7 This is a flowchart illustrating the training logic of a multimodal semantic segmentation network provided in one embodiment of this application. The process sequentially connects the complete steps from data input, feature extraction, gated fusion, alignment training to final inference output. Specifically: First, the visible light image and the corresponding X-modality image are used as input and fed into the dominant flow branch and auxiliary flow branch, respectively. During the feature extraction stage, each stage of both branches deploys a corresponding gated complementary feature fusion module. One end of this module is bidirectionally connected to the corresponding stage of the dominant feature extraction layer, and the other end is unidirectionally connected to the same stage of the auxiliary feature extraction layer. Through unidirectional injection, high-value features of the auxiliary modality are accurately injected into the dominant flow branch, while strictly protecting the dominant features from interference from auxiliary modality noise. Subsequently, the dominant and auxiliary features, after feature extraction and gated fusion, enter the first output layer and the second output layer, respectively. During the decoding stage, through asymmetric feature alignment and perceptual decision alignment mechanisms, the auxiliary flow branch and the dominant flow branch achieve collaboration in the feature space and decision space, ensuring that auxiliary information can effectively enhance the dominant semantic features while avoiding interference from auxiliary modality uncertainty on the parameter updates of the dominant branch. Finally, the first and second output layers generate the first and second segmentation results, respectively. An additive ensemble strategy is then used to superimpose the two results at the pixel level, resulting in a final high-precision multimodal semantic segmentation map. The entire process utilizes a dominant-auxiliary unidirectional collaborative mechanism, preserving the integrity of the RGB dominant features while fully leveraging the complementary information of the X modality, thereby effectively improving the semantic segmentation accuracy and robustness in complex environments.
[0063] The method is illustrated below with a specific embodiment.
[0064] This embodiment constructs a semantic segmentation system specifically for nighttime perception in autonomous driving, aiming to verify the robustness and accuracy of the invention under extreme lighting conditions. In terms of system configuration and training, this embodiment selects a pre-trained Mix Transformer-B2 (MiT-B2) as the feature extraction backbone for both the main and auxiliary streams. The number of channels increases progressively with each layer: 64, 128, 320, and 512. The decoder employs a lightweight multilayer perceptron (MLP) structure, and the number of output classes is consistent with the multispectral semantic segmentation dataset (MFNet dataset) (9 classes). The input image size is uniformly set to 480×640 pixels. The training process is based on the PyTorch framework and the NVIDIA RTX 3090 computing platform, using the MFNet dataset containing 1569 pairs of RGB-thermal imaging images. The optimization strategy employs the AdamW optimizer combined with Poly learning rate decay, with the initial learning rate set to 6× The total training epochs are 150. The loss function is designed as a weighted sum of standard cross-entropy loss, asymmetric feature alignment loss, and region-aware decision alignment loss, with weight coefficients... and All are set to 1.0 to balance the basic supervision and alignment constraints.
[0065] In actual inference, when the system inputs a typical nighttime driving scene image containing strong headlight glare interfering with the RGB camera's field of view, along with a corresponding thermal image, the network immediately initiates the forward propagation process. At this point, the gated complementary feature fusion module detects that the RGB branch (i.e., the dominant flow branch) exhibits disordered features in the glare region, while the thermal imaging branch (i.e., the auxiliary flow branch) retains a clear human thermal radiation outline in the same region. Based on the principle of symbolic correlation, the gated complementary feature fusion module automatically calculates that the gate weight W in this region is close to 1, thus strongly activating and unidirectionally injecting the thermal imaging features into the dominant flow, filling the semantic blind spot of RGB. Finally, the system outputs a superimposed and corrected prediction map, successfully and clearly segmenting the pedestrian target obscured by strong light, while the background area unaffected by glare continues to maintain high-resolution RGB texture details.
[0066] The test results of this embodiment fully demonstrate the beneficial effects of the proposed method: The asymmetric architecture significantly improves the system's anti-interference capability. Even with blurred or low-quality auxiliary modal images, it can still effectively filter noise. Compared to traditional methods that directly stitch features together, the segmentation accuracy mIoU is improved by approximately 4.4%, and performance degradation is not caused by poor-quality data. In extreme environments where RGB images are completely ineffective, the system adaptively adjusts feature dependency weights through a dynamic injection mechanism using a gated complementary feature fusion module, significantly reducing missed pedestrian detections at night and demonstrating outstanding stability. Regarding computational efficiency, this embodiment achieves the same results as similar technologies that require a large backbone network like MiT-B4 using only the lightweight MiT-B2 backbone network. It boasts high parameter utilization and can be directly deployed on in-vehicle embedded devices.
[0067] It should be understood that, without departing from the core technical ideas of this application, the technical solutions described in this application can also have various alternative implementation methods. For example, in the selection of the backbone network, the MixTransformer (MiT) used in the embodiments can be replaced by other mainstream feature extraction networks such as Swin Transformer or ResNet. In terms of the specific form of the alignment loss function, the L2 norm loss used in asymmetric feature alignment can be replaced by cosine similarity loss or KL divergence loss, and the L1 loss used in region-aware decision alignment can also be replaced by cross-entropy loss. In addition, in terms of modality expansion, the input X modality of the auxiliary flow branch is not limited to thermal imaging or depth data in the embodiments, but can also be extended to event camera or LiDAR data, and even multiple auxiliary data can be input simultaneously through channel stitching, while the network structure remains applicable.
[0068] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A multimodal semantic segmentation method based on asymmetric fusion, characterized in that, include: Obtain the color modal image to be processed and the corresponding X modal image; The color modal image and the X modal image are input into a pre-trained multimodal semantic segmentation network. The multimodal semantic segmentation network includes a dual-stream feature extraction module and a gated complementary feature fusion module. The dual-stream feature extraction module has a main stream branch and an auxiliary stream branch. The main stream branch has a first feature extraction layer and a first output layer connected in sequence. The auxiliary stream branch has a second feature extraction layer and a second output layer connected in sequence. One end of the gated complementary feature fusion module is unidirectionally connected to the second feature extraction layer, and the other end is bidirectionally connected to the first feature extraction layer. The first feature extraction layer extracts features from the color modality image to obtain the original dominant features; The X-modal image is subjected to feature extraction through the second feature extraction layer to obtain the original auxiliary features; The gated complementary feature fusion module performs feature processing on the original dominant feature to obtain the dominant channel descriptor vector, and performs feature processing on the original auxiliary feature to obtain the auxiliary channel descriptor vector. Based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant feature, and the original auxiliary feature, the enhanced dominant feature is calculated. The enhanced dominant features are segmented using the first output layer to obtain a first segmentation result; The original auxiliary features are segmented using the second output layer to obtain a second segmentation result; Based on the first segmentation result and the second segmentation result, a multimodal semantic segmentation map is obtained.
2. The multimodal semantic segmentation method according to claim 1, characterized in that, The gated complementary feature fusion module includes a first feature processing unit, a second feature processing unit, a gated weight generation unit, and a feature enhancement unit. The first feature processing unit has a first pooling layer and a first multilayer perceptron connected in sequence. The second feature processing unit has a second pooling layer and a second multilayer perceptron connected in sequence. The output terminals of the first multilayer perceptron and the second multilayer perceptron are connected to the gated weight generation unit, and the gated weight generation unit is connected to the feature enhancement unit.
3. The multimodal semantic segmentation method according to claim 2, characterized in that, The process of performing feature processing on the original dominant features through the gated complementary feature fusion module to obtain the dominant channel descriptor vector includes: The original dominant feature is pooled through the first pooling layer to obtain the dominant pooled feature; The dominant pooling features are classified using the first multilayer perceptron to obtain the dominant channel descriptor vector.
4. The multimodal semantic segmentation method according to claim 2, characterized in that, The step of performing feature processing on the original auxiliary features to obtain the auxiliary channel descriptor vector includes: The original auxiliary features are pooled using the second pooling layer to obtain auxiliary pooled features; The auxiliary pooling features are classified using the second multilayer perceptron to obtain the auxiliary channel descriptor vector.
5. The multimodal semantic segmentation method according to claim 1, characterized in that, The enhanced dominant feature is calculated based on the dominant channel descriptor vector, the auxiliary channel descriptor vector, the original dominant feature, and the original auxiliary feature, including: The gating weights are calculated based on the dominant channel descriptor vector and the auxiliary channel descriptor vector. Based on the gating weights, the original dominant features, and the original auxiliary features, preliminary fusion features are calculated; Spatial attention is calculated on the preliminary fusion features to obtain a spatial attention map; Based on the spatial attention map and the preliminary fusion features, the enhanced dominant features are calculated.
6. The multimodal semantic segmentation method according to claim 1, characterized in that, The training steps of the multimodal semantic segmentation network include: An initial multimodal semantic segmentation model is constructed, which includes an initial dual-stream feature extraction module and a gated complementary feature fusion module. The initial dual-stream feature extraction module includes an initial dominant stream branch and an initial auxiliary stream branch. The initial dominant stream branch has an initial first feature extraction layer and an initial first output layer connected in sequence. The initial auxiliary stream branch has an initial second feature extraction layer and an initial second output layer connected in sequence. Obtain a training sample set, which includes color modal image samples and corresponding X modal image samples; The color modality image samples are subjected to feature extraction through the initial first feature extraction layer to obtain the training dominant features; The X-modal image samples are used to extract features through the initial second feature extraction layer to obtain training auxiliary features; The training dominant feature is processed by the gated complementary feature fusion module to obtain the training dominant channel descriptor vector, and the training auxiliary feature is processed to obtain the training auxiliary channel descriptor vector. Based on the training dominant channel descriptor vector, the initial training auxiliary channel descriptor vector, the initial training dominant feature, and the training auxiliary feature, the training enhancement dominant feature is calculated. The training enhancement dominant features are processed through the initial first output layer to obtain the first training segmentation result; The training auxiliary features are processed by the initial second output layer to obtain the second training segmentation result; Based on the preset target loss function, the first training segmentation result, and the second training segmentation result, the parameters of the initial dominant flow branch and the initial auxiliary flow branch are adjusted to obtain the trained multimodal semantic segmentation network.
7. The multimodal semantic segmentation method according to claim 6, characterized in that, The target loss function is calculated according to the following steps: Based on the training auxiliary channel descriptor vector and the training enhancement dominant feature, calculate the auxiliary modality components contained in the training dominant feature; Based on the auxiliary modal components and the training auxiliary features, the first loss function is calculated; Generate a semantic mask based on the first training segmentation result; Based on the semantic mask, the first training segmentation result, and the second training segmentation result, the second loss function is calculated; The target loss function is obtained based on the first loss function, the second loss function, and the segmentation cross-entropy loss.
8. The multimodal semantic segmentation method according to claim 1, characterized in that, The first output layer includes a first decoder and a first segmentation head; The enhanced dominant features are processed by the first output layer to obtain a first segmentation result, including: The enhanced dominant feature is processed by the first decoder to obtain the dominant modality decoded feature. The first segmentation head is used to perform prediction and segmentation processing on the dominant modality decoding features to obtain the first segmentation result.
9. An electronic device, characterized in that, include: At least one processor; At least one memory for storing at least one program; The multimodal semantic segmentation method as described in any one of claims 1 to 8 is implemented when at least one of the programs is executed by at least one of the processors.
10. A computer-readable storage medium storing computer-executable instructions, characterized in that, The computer-executable instructions are used to execute the multimodal semantic segmentation method as described in any one of claims 1 to 8.