An RGB-T image multi-modal semantic segmentation method based on a state space
By fusing features from RGB and Thermal images in state space and utilizing cross-modal alternating scanning and multi-scale decoders, the high computational complexity and poor robustness of existing technologies are addressed, enabling real-time and efficient semantic segmentation on edge devices, especially accurate target recognition in complex nighttime scenes.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGDONG UNIV OF TECH
- Filing Date
- 2025-12-10
- Publication Date
- 2026-06-26
Smart Images

Figure CN121305091B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of semantic segmentation technology, and in particular relates to a multimodal semantic segmentation method for RGB-T images based on state space. Background Technology
[0002] Semantic segmentation is crucial for object recognition and localization in autonomous driving and robotic vision systems. However, traditional segmentation methods based solely on visible light images often fail in low-light, high-reflection, or hazy environments. In contrast, thermal imaging provides stable radiometric information under complex lighting and adverse weather conditions, enhancing segmentation robustness. Fusing RGB and thermal images can significantly improve segmentation performance, especially under various adverse environmental conditions. Early RGB and thermal data fusion methods relied primarily on classic strategies such as feature concatenation, modal superposition, and channel weighting. While these methods can fuse information, they often ignore inherent differences between modalities, leading to poor results. For example, feature concatenation does not adequately consider intermodal relationships, and modal superposition and channel weighting methods cannot effectively capture nonlinear relationships. To overcome these problems, recent research has introduced customized fusion modules to improve the accuracy of information fusion. However, most methods still rely on convolutional attention mechanisms, resulting in high computational complexity and incomplete information fusion. Transformer-based models can better capture cross-modal dependencies, but their high computational cost limits practical applications. To overcome the limitations of existing methods, Mamba has made initial attempts in visible light and thermal imaging fusion tasks by leveraging a global receptive field and dynamically weighted linear complexity. Based on the VMamba architecture, they achieved initial results by fusing modes in state space. However, most methods still process each mode independently and depend on the parameters of specific modes, resulting in weak intermodal correlations and over-reliance on a single mode.
[0003] In summary, the existing technology has the following drawbacks: 1. The computational cost of using neural network models based on traditional Transformer architecture and dense convolutional architecture is too high, making them unsuitable for deployment in real-world scenarios.
[0004] 2. Existing cross-modal Mamba fusion processes modes independently to obtain specific state parameters C, and then exchanges them to construct state associations through state equations, which is too dependent on a single mode.
[0005] 3. Existing semantic segmentation decoders lack effective cross-level information interaction and fail to fully utilize contextual information at different levels, resulting in poor robustness of the model in complex scenarios.
[0006] 4. Existing methods fail to fully exploit the potential nonlinear relationships across modalities in their feature fusion layers, resulting in incomplete information fusion and affecting the final segmentation performance.
[0007] Therefore, this paper proposes a method that can more effectively integrate RGB and thermal image features and fully utilize the interaction between modes during the fusion process. This method overcomes the problems of information loss and excessive computational complexity in existing methods and is of great significance for improving the accuracy and efficiency of semantic segmentation of visible light and thermal images. Summary of the Invention
[0008] To address the aforementioned technical problems, this invention proposes a state-space-based RGB-T image multimodal semantic segmentation method, which overcomes the issues of information loss and excessive computational complexity in existing methods.
[0009] To achieve the above objectives, this invention provides a state-space-based RGB-T image multimodal semantic segmentation method, comprising:
[0010] Acquire visible light RGB images and thermal images;
[0011] The RGB image and Thermal image are input into a semantic segmentation model to obtain semantic segmentation results. The processing of the RGB image and Thermal image using the semantic segmentation model includes:
[0012] The RGB image and Thermal image are input into a twin encoder with shared parameters to extract multi-level features, resulting in RGB feature maps and Thermal feature maps.
[0013] The RGB feature map and Thermal feature map are input into the feature fusion module. A state space sequence is constructed through a cross-modal alternating scanning mechanism, and frequency domain and spatial domain supplementary parameters are introduced to generate a fused feature map.
[0014] The fused feature map is input into a multi-scale decoder, and a semantic segmentation prediction map is generated through cross-level feature interaction and context aggregation mechanisms.
[0015] Optionally, the shared parameter twin encoder includes: a plurality of VSSB downsampling blocks;
[0016] The VSSB downsampling block is used to extract the spatial features of the RGB and Thermal images layer by layer, and output the corresponding RGB feature maps and Thermal feature maps.
[0017] Optionally, the feature fusion module includes: a feature processing unit, a cross-modal representation unit, and a feature fusion unit;
[0018] The feature processing unit is used to perform linear projection and convolution operations on the input RGB feature map and Thermal feature map respectively to generate the corresponding first intermediate feature map and second intermediate feature map;
[0019] The cross-modal representation unit is used to obtain a state-space representation containing alternating RGB and Thermal sequences based on the first intermediate feature map and the second intermediate feature map.
[0020] The feature fusion unit is used to fuse the state space representation to obtain the fused feature map.
[0021] Optionally, the cross-modal characterization unit includes: an alternating cross-modal selective scanning layer, a frequency-space information generation layer, a state-space modeling layer, and a scan merging layer;
[0022] The alternating cross-modal selective scanning layer is used to input the first intermediate feature map and the second intermediate feature map into the alternating scanning process in four directions to obtain a cross-modal visual sequence, wherein the cross-modal visual sequence includes a global sequence and a local sequence;
[0023] The frequency-spatial information generation layer is used to introduce frequency domain and spatial domain supplementary parameters to process the first intermediate feature map and the second intermediate feature map to obtain spatial parameters and frequency state parameters.
[0024] The state space modeling layer is used to obtain a state space sequence based on the cross-modal visual sequence, spatial parameters, and frequency state parameters. The state space sequence is reorganized into a four-directional perception sequence, wherein the four directions are a global positive row, a global positive column, a window positive row, and a window positive column.
[0025] The scan merging layer is used to merge the state space sequence to obtain a state space representation containing alternating RGB and Thermal sequences.
[0026] Optionally, frequency and spatial domain supplementary parameters are introduced to process the first and second intermediate feature maps to obtain spatial parameters and frequency state parameters, including:
[0027] Multi-scale dilation convolution operations are performed on the first intermediate feature map and the second intermediate feature map respectively to extract local spatial information and generate spatial parameters.
[0028] Fast Fourier Transform and linear attention operation are performed on the first intermediate feature map and the second intermediate feature map respectively to extract frequency domain information and generate frequency state parameters.
[0029] Optionally, obtaining the state-space sequence based on the cross-modal visual sequence, spatial parameters, and frequency state parameters includes:
[0030] ;
[0031] ;
[0032] in, RGB is hidden. To discretize the cross-modal state transition matrix, This is the previous thermal hidden state. To discretize the input coupling matrix, For the current RGB modality input features, It is a state-space sequence. For the observation matrix, For frequency state parameters, For direct access channel matrix, For spatial parameters.
[0033] Optionally, the multi-scale decoder includes: a plurality of upsampling blocks;
[0034] The upsampling block is used to receive fused feature maps from the current level and the previous level, and generate the decoded feature map of the current level through resolution alignment, channel dimension aggregation and convolution operations.
[0035] Optionally, the resolution alignment is achieved through average pooling and upsampling operations;
[0036] The channel dimension aggregation is achieved through 1×1 convolution and element-wise multiplication operations;
[0037] The convolution operation includes 3×3 convolution for feature fusion.
[0038] Optionally, the multi-scale decoder concatenates the decoded feature maps from all levels and inputs them into a multilayer perceptron to generate a final semantic segmentation prediction map with the same resolution as the input image.
[0039] Compared with the prior art, the present invention has the following advantages and technical effects:
[0040] (1) This invention unifies the encoder, fusion block and decoder in the state space. Thanks to the linear computational complexity of Mamba, a good balance is achieved between model complexity and accuracy, making it particularly suitable for application on resource-constrained edge devices.
[0041] (2) The cross-modal state space alternating scan algorithm designed in this invention overcomes the defect of over-reliance on a single mode caused by constructing state associations with mode-specific parameters C, and overcomes the inherent defect of the SSM mechanism in gradually constructing global associations.
[0042] (3) This invention can perform depth estimation in real-world scenarios by training or fine-tuning the model, and can achieve real-time performance. Attached Figure Description
[0043] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0044] Figure 1 This is a flowchart of an RGB-T image multimodal semantic segmentation method based on state space according to an embodiment of the present invention;
[0045] Figure 2 This is a framework diagram of an RGB-T image multimodal semantic segmentation method based on state space according to an embodiment of the present invention;
[0046] Figure 3 This is a schematic diagram of the feature fusion module according to an embodiment of the present invention;
[0047] Figure 4 This is a schematic diagram of a local spatial and global frequency parameter generation unit for feature fusion according to an embodiment of the present invention. Detailed Implementation
[0048] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0049] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0050] This embodiment proposes a state-space-based RGB-T image multimodal semantic segmentation method, such as... Figure 1 As shown, the specific steps include:
[0051] Acquire visible light RGB images and thermal images;
[0052] The RGB image and Thermal image are input into a semantic segmentation model to obtain semantic segmentation results. The processing of the RGB image and Thermal image using the semantic segmentation model includes:
[0053] The RGB image and Thermal image are input into a twin encoder with shared parameters to extract multi-level features, resulting in RGB feature maps and Thermal feature maps.
[0054] The RGB feature map and Thermal feature map are input into the feature fusion module. A state space sequence is constructed through a cross-modal alternating scanning mechanism, and frequency domain and spatial domain supplementary parameters are introduced to generate a fused feature map.
[0055] The fused feature map is input into a multi-scale decoder, and a semantic segmentation prediction map is generated through cross-level feature interaction and context aggregation mechanisms.
[0056] Specifically, such as Figure 2 As shown, this embodiment proposes a state-space-based RGB-T image multimodal semantic segmentation architecture. It explores the application of state-space models in multimodal environments. Specifically, considering the complex characteristics of daytime and nighttime scenes, this embodiment designs a feature fusion module. This module combines cross-modal alternating scanning and introduces state supplementary parameters in the spatial and frequency domains to construct alternating state equations for efficient fusion. To effectively utilize the fused features, this embodiment designs a multi-scale upsampling block, achieving progressive decoding through scale refinement and perceptual context. To verify its practical performance, recognition was performed on real-world daytime and nighttime scenes. By deploying the model on edge devices, a speed of 35 frames per second was achieved, enabling real-time computation. It also demonstrated optimal performance in terms of segmentation mIoU. A series of experiments show the effectiveness of the proposed ACNet and the balance between computational complexity and accuracy.
[0057] This embodiment collects a dataset based on the complex scene characteristics of daytime and nighttime. The dataset is divided into daytime and nighttime, and contains a dataset with 9 categories of labels. The output is a mask image in JPG format. The label of each pixel is a category number (0 represents background, 1 represents car, 2 represents pedestrian, 3 represents bicycle, 4 represents curb, 5 represents parking spot, 6 represents guardrail, 7 represents traffic cone, and 8 represents speed bump).
[0058] Specifically, such as Figure 3 and Figure 4As shown, alternating cross-modal selective scanning is employed in the feature fusion layer, supplemented with frequency and local information. A composite module consisting of an alternating cross-modal selective scanning layer, a frequency-spatial information generation layer, a state-space modeling layer, and a scan merging layer is constructed. This structure mitigates modal representation differences through alternating scanning and combines local and frequency information to suppress large-scale semantic confusion and improve the segmentation effect of small-scale targets. It can effectively segment the contours and boundaries of distant small-scale and nearby large-scale objects, thereby significantly improving the segmentation effect.
[0059] Overall, the advantage of this embodiment lies in achieving a good balance between model complexity and accuracy, making it particularly suitable for application on resource-constrained edge devices, and capable of achieving good recognition results in specific nighttime scenes.
[0060] Furthermore, the shared parameter twin encoder includes: a plurality of VSSB downsampling blocks;
[0061] The VSSB downsampling block is used to extract the spatial features of the RGB image and the Thermal image layer by layer, and output the RGB feature map and Thermal feature map of the corresponding level.
[0062] Specifically, this embodiment designs a Siamese shared network. To speed up inference and save parameters, the encoders for RGB and Thermal inputs share parameters. Given an RGB image... and Thermal images Then, downsampling is performed layer by layer using VSSB downsampling blocks, which employ VSSB blocks in VMamba, ultimately yielding two sets of feature values: RGB and Thermal. and Among them, C, H, W and These represent the channel dimension, spatial resolution, and encoder layer index, respectively.
[0063] Furthermore, the feature fusion module includes: a feature processing unit, a cross-modal representation unit, and a feature fusion unit;
[0064] The feature processing unit is used to perform linear projection and convolution operations on the input RGB feature map and Thermal feature map respectively to generate the corresponding first intermediate feature map and second intermediate feature map;
[0065] The cross-modal representation unit is used to obtain a state-space representation containing alternating RGB and Thermal sequences based on the first intermediate feature map and the second intermediate feature map.
[0066] The feature fusion unit is used to fuse the state space representation to obtain the fused feature map.
[0067] Specifically, such as Figure 3 As shown, given the encoder layer, the obtained and These features are first linearly projected and then processed in four stages to enhance cross-modal representation: (i) alternating cross-modal selective scanning, (ii) frequency-space information generation, (iii) state-space modeling, and (iv) scan merging. The features obtained from linear projection... :
[0068] ;
[0069] ;
[0070] in, Represented as the SiLU activation function, This represents a linear fully connected layer. The expression indicates convolution, and the subscript indicates the kernel size.
[0071] Finally, to enhance feature representation, this embodiment introduces global feature enhancement based on residual connections to obtain RGB and Thermal modes. and Global representation:
[0072] ;
[0073] ;
[0074] Where LN represents layer normalization, and FC represents a linear fully connected layer. This represents the multiplication operation. Then... and pass Convolution is used for fusion. The SiLU activation function is used to generate the final representation. :
[0075] ;
[0076] Furthermore, the cross-modal representation unit includes: an alternating cross-modal selective scanning layer, a frequency-space information generation layer, a state-space modeling layer, and a scan merging layer;
[0077] The alternating cross-modal selective scanning layer is used to input the first intermediate feature map and the second intermediate feature map into the alternating scanning process in four directions to obtain a cross-modal visual sequence, wherein the cross-modal visual sequence includes a global sequence and a local sequence;
[0078] The frequency-spatial information generation layer is used to introduce frequency domain and spatial domain supplementary parameters to process the first intermediate feature map and the second intermediate feature map to obtain spatial parameters and frequency state parameters.
[0079] The state space modeling layer is used to obtain a state space sequence based on the cross-modal visual sequence, spatial parameters, and frequency state parameters, wherein the state space sequence is reorganized into a four-directional perception sequence.
[0080] The scan merging layer is used to merge the state space sequence to obtain a state space representation containing alternating RGB and Thermal sequences.
[0081] Specifically, alternating cross-modal selective scanning layers: subsequently, features are generated. and The data are input into alternating cross-modal selective scan blocks to jointly construct a state equation for establishing cross-modal correlations. RGB and Thermal features are input into the alternating scan process in four directions designed in this embodiment, wherein the four directions are global positive rows, global positive columns, window positive rows, and window positive columns, to construct a cross-modal visual sequence. The scan is divided into global and local sequences, among which... and Represents a global sequence. and Represents a local sequence, with a window size of [missing information]. Each sequence Patch of alternating RGB and Thermal and The sequence is composed of [a set of parameters]. Then, from each sequence, learnable parameters B, C, and Δ are obtained through linear projection, while A and D are defined as learnable state parameters. This represents a predefined time-scale parameter, which discretizes the continuous parameters A and B into a discrete state space, as shown in the following formula:
[0082] .
[0083] Furthermore, frequency and spatial domain supplementary parameters are introduced to process the first and second intermediate feature maps, obtaining spatial parameters and frequency state parameters, including:
[0084] Multi-scale dilation convolution operations are performed on the first intermediate feature map and the second intermediate feature map respectively to extract local spatial information and generate spatial parameters.
[0085] Fast Fourier Transform and linear attention operation are performed on the first intermediate feature map and the second intermediate feature map respectively to extract frequency domain information and generate frequency state parameters.
[0086] Specifically, frequency-spatial information generation: To alleviate the local spatial misalignment and disorder caused by flattening SSM image patches into a one-dimensional sequence, and to address the issue of gradual weakening of inherent structural information in Mamba global context modeling, this embodiment introduces local spatial generation information blocks to capture local neighborhood correlations, thereby generating state parameters. Given RGB and Thermal eigenvalues Since both modalities follow the same processing flow, this embodiment uses the RGB modality as an example to illustrate the process. Dilated convolutions with different dilation rates d are used to capture multi-scale local context and obtain spatial parameters. and :
[0087] ;
[0088] ;
[0089] ;
[0090] Here, Flatten represents the flattening operation; the same operation is also applied to the thermal mode, resulting in... and Then, the spatial parameters of the two modes are added together to obtain the spatial parameters. :
[0091] ;
[0092] Similarly, considering the asymptotic global correlation defect in Mamba, this embodiment introduces a global frequency information generation block, which is obtained through Fast Fourier Transform, linear attention mechanism, and Inverse Fourier Transform. and :
[0093] ;
[0094] ;
[0095] ;
[0096] ;
[0097] Where Atten represents linear attention, fft represents fast Fourier transform, and ifft represents inverse fast Fourier transform, the semantics of which are then concatenated and used to generate frequency state parameters through a multilayer perceptron. :
[0098] ;
[0099] Cat represents splicing, and FC represents a linear fully connected layer.
[0100] Further, the method for obtaining the state-space sequence based on the cross-modal visual sequence, spatial parameters, and frequency state parameters includes:
[0101] ;
[0102] ;
[0103] in, RGB is hidden. To discretize the cross-modal state transition matrix, This is the previous thermal hidden state. To discretize the input coupling matrix, For the current RGB modality input features, It is a state-space sequence. For the observation matrix, For frequency state parameters, For direct access channel matrix, For spatial parameters.
[0104] Specifically, in the state-space modeling layer: Then, using the acquired state parameters and their complementary components, this embodiment jointly establishes state equations to generate states for a specific mode. This process dynamically switches between modes: when the RGB mode is active, the hidden state... From the current RGB input And the previous thermal hidden state Updated to preserve cross-modal interactions and temporal memory. Because the standard SSM construction process gradually builds global and local dependencies, this embodiment further incorporates frequency and spatial information. and Injected into the discrete SSM equations.
[0105] Scanning merging layer: After state-space modeling, RGB features and Thermal features are recombined into four-directional sensing sequences. and , where each sequence (e.g., For a given scanning direction, the sequences from all directions are summed to form a feature. and .
[0106] Furthermore, the multi-scale decoder includes: a plurality of upsampling blocks;
[0107] The upsampling block is used to receive fused feature maps from the current level and the previous level, and generate the decoded feature map of the current level through resolution alignment, channel dimension aggregation and convolution operations.
[0108] Furthermore, the resolution alignment is achieved through average pooling and upsampling operations;
[0109] The channel dimension aggregation is achieved through 1×1 convolution and element-wise multiplication operations;
[0110] The convolution operation includes 3×3 convolution for feature fusion.
[0111] Furthermore, the multi-scale decoder concatenates the decoded feature maps from all levels and inputs them into the multilayer perceptron to generate a final semantic segmentation prediction map with the same resolution as the input image.
[0112] Specifically, the multi-scale decoder: To mitigate semantic bias and feature degradation caused by the lack of global context during progressive decoding, this embodiment enhances global semantic understanding through cross-level feature interaction and multi-scale context aggregation. It fuses features... As the first layer of MSCD The input is the semantic features generated by the previous layers, and subsequent layers gradually integrate these features to construct a hierarchical semantic feature map. The feature mapping is defined as follows:
[0113] ;
[0114] Where j represents the layer index of the upsampling block, This represents the feature map generated by the corresponding upsampling block. Except for the first layer, this represents the input feature set of the j-th upsampling block. Including the output of the first (j-1) layers enables progressive multi-semantic interactions and aggregations. This design helps to construct globally consistent and semantically rich feature representations. For example... The overall process will adapt accordingly; this embodiment uses... For example, firstly and Resolution alignment, uniformly aligned to The resolution was obtained. Similarly, for the third layer, Will , and Unify resolution alignment to This alignment operation is performed through average pooling and upsampling. Then, in this embodiment, it is necessary to... Aggregate all eigenvalues; this aggregation operation is performed using... The convolution aligns each feature map along the channel dimension. Then, the aligned features are combined through element-wise multiplication, and finally, high-level semantics are preserved through residual connections. This aggregation is recursively applied. All features are then used to generate aggregated features. :
[0115] ;
[0116] ;
[0117] GF stands for aggregation operation.
[0118] To improve computational efficiency and stabilize global multi-scale aggregation All features are first projected onto a common spatial resolution and channel dimension via convolution, resulting in... .
[0119] ;
[0120] Similarly, we finally get , , Then, in this embodiment, all features are concatenated and upsampled to a common resolution using a multilayer perceptron (MLP), finally generating a prediction map:
[0121] ;
[0122] Finally, in this embodiment, the loss between Predict and GT can be calculated using cross-entropy loss.
[0123] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for multimodal semantic segmentation of RGB-T images based on state space, characterized in that, include: Acquire visible light RGB images and thermal images; The RGB image and Thermal image are input into a semantic segmentation model to obtain semantic segmentation results. The processing of the RGB image and Thermal image using the semantic segmentation model includes: The RGB image and Thermal image are input into a twin encoder with shared parameters to extract multi-level features, resulting in RGB feature maps and Thermal feature maps. The RGB feature map and Thermal feature map are input into the feature fusion module. A state space sequence is constructed through a cross-modal alternating scanning mechanism, and frequency domain and spatial domain supplementary parameters are introduced to generate a fused feature map. The feature fusion module includes: a feature processing unit, a cross-modal representation unit, and a feature fusion unit; The feature processing unit is used to perform linear projection and convolution operations on the input RGB feature map and Thermal feature map respectively to generate the corresponding first intermediate feature map and second intermediate feature map; The cross-modal representation unit includes: an alternating cross-modal selective scanning layer, a frequency-space information generation layer, a state-space modeling layer, and a scan merging layer; The alternating cross-modal selective scanning layer is used to input the first intermediate feature map and the second intermediate feature map into the alternating scanning process in four directions to obtain a cross-modal visual sequence, wherein the cross-modal visual sequence includes a global sequence and a local sequence, and the four directions are global positive rows, global positive columns, window positive rows and window positive columns; The frequency-spatial information generation layer is used to introduce frequency domain and spatial domain supplementary parameters to process the first intermediate feature map and the second intermediate feature map to obtain spatial parameters and frequency state parameters. The state space modeling layer is used to obtain a state space sequence based on the cross-modal visual sequence, spatial parameters, and frequency state parameters, wherein the state space sequence is reorganized into a four-directional perception sequence. The scan merging layer is used to merge the state space sequence to obtain a state space representation containing alternating RGB and Thermal sequences; Based on the cross-modal visual sequence, spatial parameters, and frequency state parameters, obtaining the state space sequence includes: ; ; ; ; in, RGB is hidden. To discretize the cross-modal state transition matrix, This is the previous thermal hidden state. To discretize the input coupling matrix, For the current RGB modality input features, It is a state-space sequence. For the observation matrix, For frequency state parameters, For direct access channel matrix, Spatial parameters; The fused feature map is input into a multi-scale decoder, and a semantic segmentation prediction map is generated through cross-level feature interaction and context aggregation mechanisms.
2. The method for multimodal semantic segmentation of RGB-T images based on state space as described in claim 1, characterized in that, The shared parameter twin encoder includes: a plurality of VSSB downsampling blocks; The VSSB downsampling block is used to extract the spatial features of the RGB image and the Thermal image layer by layer, and output the RGB feature map and Thermal feature map of the corresponding level.
3. The method for multimodal semantic segmentation of RGB-T images based on state space according to claim 1, characterized in that, The cross-modal representation unit is used to obtain a state-space representation containing alternating RGB and Thermal sequences based on the first intermediate feature map and the second intermediate feature map. The feature fusion unit is used to fuse the state space representation to obtain the fused feature map.
4. The method for multimodal semantic segmentation of RGB-T images based on state space according to claim 1, characterized in that, Introducing frequency and spatial domain supplementary parameters to process the first and second intermediate feature maps, obtaining spatial parameters and frequency state parameters, including: Multi-scale dilation convolution operations are performed on the first intermediate feature map and the second intermediate feature map respectively to extract local spatial information and generate spatial parameters. Fast Fourier Transform and linear attention operation are performed on the first intermediate feature map and the second intermediate feature map respectively to extract frequency domain information and generate frequency state parameters.
5. The method for multimodal semantic segmentation of RGB-T images based on state space according to claim 1, characterized in that, The multi-scale decoder includes: several upsampling blocks; The upsampling block is used to receive fused feature maps from the current level and the previous level, and generate the decoded feature map of the current level through resolution alignment, channel dimension aggregation and convolution operations.
6. The method for multimodal semantic segmentation of RGB-T images based on state space according to claim 5, characterized in that, The resolution alignment is achieved through average pooling and upsampling operations; The channel dimension aggregation is achieved through 1×1 convolution and element-wise multiplication operations; The convolution operation includes 3×3 convolution for feature fusion.
7. The method for multimodal semantic segmentation of RGB-T images based on state space according to claim 6, characterized in that, The multi-scale decoder concatenates the decoded feature maps from all levels and inputs them into the multilayer perceptron to generate a final semantic segmentation prediction map with the same resolution as the input image.