A remote sensing image semantic segmentation method based on state space modeling and hybrid expert mechanism
By combining the state-space model with the MambaMoENet model of a hybrid expert system, the challenges of global context modeling and local detail capture in remote sensing images were solved, achieving efficient semantic segmentation of high-resolution remote sensing images and improving segmentation accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YUNNAN NORMAL UNIV
- Filing Date
- 2026-05-13
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244870A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a remote sensing image semantic segmentation method based on state-space modeling and hybrid expert mechanism, belonging to the field of remote sensing image semantic segmentation technology. Background Technology
[0002] With the continuous deployment of sub-meter resolution next-generation satellite sensors, the number of high-resolution remote sensing images acquired has increased exponentially, posing challenges to the current intelligent interpretation of remote sensing semantic segmentation. On the one hand, the blurred boundaries of ground features and mutual occlusion of targets caused by complex scenes in high-resolution images make semantic segmentation models more prone to semantic ambiguity. On the other hand, the high heterogeneity of multi-scale targets in spatial distribution places higher demands on the feature representation capabilities of the models.
[0003] Traditional semantic segmentation methods are limited by the expressive limitations of handcrafted features, making them ill-suited for complex remote sensing scenarios. For example, Gray-Level Co-occurrence Matrix (GLCM) can only provide a shallow description of texture features, and Random Forest has shortcomings in modeling nonlinear decision boundaries, failing to effectively handle the rich details and complex scale variations in high-resolution remote sensing images. With the continuous development of computer vision, deep learning, represented by Convolutional Neural Networks (CNNs), has achieved a leapfrog improvement in remote sensing semantic segmentation through end-to-end learning. Fully Convolutional Networks (FCNs) have achieved pixel-by-pixel semantic segmentation of remote sensing images of arbitrary sizes by replacing fully connected layers with fully convolutional layers. SegNet optimizes the feature reconstruction process by preserving the positions of pooling layer feature points and performing unpooling operations; U-Net, on the other hand, adopts a symmetrical U-shaped structure and skip connections, fusing features from the encoder and decoder, significantly improving the detail representation of the segmentation. As research progressed, the DeepLab series continued to evolve. DeepLabv1 introduced dilated convolutions, DeepLabv2 proposed the Spatial Pyramid Pooling with Dips (ASPP) module, further improving performance through multi-scale feature extraction, and DeepLabv3 introduced the ASPP module and global pooling layers, achieving a two-way improvement in feature receptive field and detail representation. PSPNet's pyramid pooling strategy, RefineNet's cascaded residual design, and DenseASPP's dense application of dilated convolutions have all further enhanced spatial feature extraction capabilities.
[0004] Although convolutional neural networks, primarily CNNs, perform well in local feature extraction, they are limited by their own feature fusion mechanism and local receptive field, resulting in poor performance in handling long-distance dependencies and making it difficult to meet the multi-scale semantic segmentation processing requirements of complex scenes in high-resolution remote sensing images.
[0005] To overcome the limitations of CNNs, visual Transformer-based methods have been introduced into the field of remote sensing image semantic segmentation. With its powerful ability to capture global contextual information, the Transformer has gradually become a research hotspot in this field. Transformer-based remote sensing image semantic segmentation methods have shown significant advantages in capturing global contextual information and long-range feature dependencies. However, their inherent limitations—such as large parameter count, high computational cost, and loss of local details—severely restrict their application in complex scenes of high-resolution remote sensing images. Especially in remote sensing scenes with significant target scale variations and complex backgrounds, a single feature extraction mechanism cannot simultaneously meet the needs of global modeling and local detail capture. Therefore, how to accurately capture local details and reduce computational complexity while maintaining efficient global contextual modeling capabilities has become a core scientific problem that urgently needs to be solved in the current field of remote sensing semantic segmentation.
[0006] Recently, a novel architecture based on the State-Space Model (SSM)—Mamba—has provided a new technical approach to address the aforementioned problems. Mamba achieves high training and inference efficiency by combining the State-Space Model with hardware optimization. Taking UNetMamba as an example, as a combination of the Mamba architecture and UNet, UNetMamba achieves efficient global context modeling, significantly reducing the secondary computational burden of traditional Transformers. However, the Mamba method still has shortcomings in multi-scale feature fusion and local detail capture, making it difficult to fully adapt to the characteristics of remote sensing images with significant target scale variations and complex edge textures.
[0007] Hybrid expert systems (MoEs), as an advanced model architecture, achieve high computational efficiency by decomposing complex tasks into multiple sub-tasks, which are then processed in parallel by dedicated "expert" modules. Tasks are dynamically allocated through a "gated network." Its core principle leverages sparsity, activating only a few experts, rather than all parameters, at a time, thereby significantly reducing computational costs while maintaining a large number of model parameters. This provides a direction for building remote sensing interpretation models that possess both high expressive power and high computational efficiency.
[0008] To overcome the limitations of existing Mamba methods in multi-scale feature fusion and local detail capture, and to effectively solve problems such as unclear feature boundaries, mutual occlusion of targets, and insufficient multi-scale target segmentation accuracy in remote sensing image segmentation, this invention proposes an innovative MambaMoENet model. This model deeply integrates the efficient global modeling capability of the State Space Model (SSM) with the dynamic expert collaboration mechanism of MoE, providing a new solution for semantic segmentation of high-resolution remote sensing images that combines efficiency, precision, and robustness. Summary of the Invention
[0009] The purpose of this invention is to provide a semantic segmentation method for remote sensing images based on state-space modeling and hybrid expert mechanisms, which aims to solve the technical problems of unclear feature boundaries, mutual occlusion of targets, and insufficient multi-scale target segmentation accuracy in existing technologies.
[0010] To achieve the above objectives, the technical solution of this invention is: a remote sensing image semantic segmentation method based on state-space modeling and a hybrid expert mechanism (MoE). This method combines SSM with a hybrid expert system (MoE) and proposes an innovative MambaMoENet model, comprising the following steps:
[0011] Step 1: Input the image to be segmented into the encoder network for stepwise feature extraction to obtain a multi-scale encoded feature set; wherein, the encoder uses a ResT network as the feature extraction backbone, and the ResT network includes a Stem module and four stages for stepwise extraction;
[0012] Step 2: Input the multi-scale encoded feature set into the decoder for stage-by-stage decoding to obtain the highest resolution decoded features. Then, upsample the highest resolution decoded features by four times to obtain the final decoded features. The decoder includes the MMFusion module, VSSLayer module, PatchExpand module, and FinalPatchExpand module.
[0013] Step 3: Input the final decoded features into the trained semantic segmentation model to obtain the semantic segmentation result output by the semantic segmentation model.
[0014] Optionally, Step 1 specifically includes:
[0015] The Stem module is used to extract low-level features from the image to be segmented. Four stages are then used to progressively downsample and increase the channel dimension of the image to be segmented, resulting in four encoded feature maps at different scales. The expression is as follows:
[0016]
[0017]
[0018]
[0019]
[0020] in, The image to be segmented, For the number of images, and These represent the height and width of the image, respectively. , , , Extract the backbone network for features from each layer. Four encoded feature maps with progressively decreasing spatial resolution and progressively enhanced semantic information are used. It is the encoded feature map with the highest spatial resolution and the shallowest semantic information. It is the coding feature map with the lowest spatial resolution and the deepest semantic information, thus obtaining a multi-scale coding feature set. }
[0021] Optionally, in each decoding stage, the decoding features from the previous stage are input into the PatchExpand module for double upsampling to obtain shallow decoding features; wherein, the feature map with the lowest spatial resolution is used as the initial decoding input, and decoding is performed step by step in order from low resolution to high resolution.
[0022] Optionally, the MMFusion module is used to fuse shallow decoding features with corresponding scale coding features. The fusion includes sequentially performing feature alignment, multi-expert feature extraction, and feature enhancement, specifically:
[0023] shallow decoding features Perform 1×1 convolutional projection and normalization to achieve feature alignment, thereby obtaining dimension-aligned decoded features. ;
[0024] Match shallow decoding features with corresponding scale encoded features Upsampled to the same level using bilinear interpolation. With the same spatial resolution, the encoded features obtained after bilinear interpolation are obtained. and in the channel dimension and By splicing the pieces together, we can obtain the splicing features. ;
[0025] splicing features The input is fed into the MoEExpert module within the MMFusion module for multi-expert feature extraction, resulting in multi-scale adaptive fusion features. and will The input is fed into VSSBlock for feature enhancement to obtain the initial fused decoded features. ;
[0026] Initial fusion decoding features Decoding features aligned with dimensions Perform residual connections to obtain the final fused decoding features. .
[0027] Optionally, the splicing feature The input is fed into the MoEExpert module within the MMFusion module for multi-expert feature extraction, specifically as follows:
[0028] The MoEExpert module consists of multiple parallel expert branches, each employing a different expansion rate. Convolution operation, for concatenating features The expression for feature extraction for each expert branch is:
[0029]
[0030] in, For expert feature extraction output, Indicates the expansion rate of Convolution operation, This indicates a batch normalization operation. This represents the GeLU nonlinear activation function;
[0031] The feature extraction outputs from each expert are summed element-wise to obtain the joint features. Then the joint features The input is fed into a shared attention network, where global contextual information is extracted using global average pooling. The expression is as follows:
[0032]
[0033]
[0034] GAP stands for Global Average Pooling. This is for global context information;
[0035] Through two layers Convolution and GeLU activation function generate expert weight vectors And normalization is performed using Softmax, the expression is:
[0036]
[0037] in, for convolution, This represents the GeLU nonlinear activation function;
[0038] The feature extraction outputs of each expert are weighted and summed according to the expert weight vector to obtain the multi-scale adaptive fusion features of the MoEExpert module. .
[0039] Optionally, the VSSLayer module is used to decode the final fused features. State-space modeling is performed to obtain the decoding features of the current stage, which are then used as input features for the next decoding stage. After completing all decoding stages, the highest resolution decoding features are obtained. and decode the highest resolution features The input is fed into the FinalPatchExpand module for quadruple upsampling to obtain the final decoded features. .
[0040] Optionally, an AdaptiveLocalSupervision module is constructed to train the semantic segmentation model, specifically as follows:
[0041] Decoding features of each decoding stage Three parallel convolutional branches are employed, each using a 3×3 depthwise separable convolution with different dilation rates, to obtain multi-scale receptive field extraction features. The expression is:
[0042]
[0043] Decoding features of each decoding stage Construct spatial attention weights, on the channel dimension, for Calculate the mean separately Maximum value with standard deviation The three-channel spatial description features were obtained. :
[0044]
[0045] in, For splicing operations;
[0046] spatial description features Input is fed into a spatial attention network to generate a spatial weight graph. The expression is:
[0047]
[0048] Where sigmoid is the activation function. express Convolution operation;
[0049] Decoding features of each decoding stage Perform channel attention modeling: Global average pooling is performed, and channel weight vectors are generated through two layers of 1×1 convolutions and non-linear mapping. The expression is:
[0050]
[0051] in, express Convolution operation;
[0052] The output of the three dilated convolution branches Element-wise summation is performed, and spatial attention constraints are introduced to weight the spatially multi-scale local features. The channel-weighted backbone features are concatenated along the channel dimension, as shown in the expression:
[0053]
[0054]
[0055] in, This indicates element-wise multiplication. Features are concatenated along the channel dimensions;
[0056] The concatenated features are fused using 1×1 convolutions to restore the channel dimension and obtain the fused features. The expression is:
[0057]
[0058] Fusion features The input is fed into the prediction head, generating auxiliary segmentation results at the current scale and serving as local auxiliary supervision results. The expression is:
[0059]
[0060] in, This is a result of local auxiliary supervision.
[0061] The beneficial effects of this invention are:
[0062] 1. This invention designs a dynamic multi-scale expert network (MoEExpert) and a Mamba-MoE collaborative fusion module (MMFusion), which achieves adaptive fusion of pixel-level multi-granular features through parallel multi-scale convolution and shared attention gating mechanism.
[0063] 2. This invention introduces Visual State Space Block (VSSBlock) to model long-range dependencies and works in conjunction with MoEExpert to form a global-local feature enhancement mechanism, which significantly improves the model's ability to process multi-scale targets and effectively solves the problems of blurred ground boundary and target occlusion.
[0064] 3. This invention designs an adaptive local supervision module, which generates intermediate supervision signals through multi-scale convolutional branches and a dual attention mechanism, optimizes the gradient propagation path, and enhances the model's feature representation ability and adaptability.
[0065] 4. The selective state space mechanism in Mamba of this invention can dynamically adjust and model long-range features, further enhancing the semantic consistency and contextual understanding of boundary pixel regions. Combined with the MOE expert system, by setting up multiple experts, each expert branch is responsible for perceiving feature information at a specific scale, realizing full-process feature enhancement from multi-scale feature extraction to long-range dependency modeling, effectively improving segmentation accuracy. Attached Figure Description
[0066] Figure 1 This is a network structure diagram of the present invention. Detailed Implementation
[0067] The technical solutions of the present invention will now be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention.
[0068] Example 1: As Figure 1 As shown, a remote sensing image semantic segmentation method based on state-space modeling and hybrid expert mechanisms includes the following steps:
[0069] Step 1: Input the image to be segmented into the encoder network for stepwise feature extraction to obtain a multi-scale encoded feature set; wherein, the encoder uses a ResT network as the feature extraction backbone, and the ResT network includes a Stem module and four stages for stepwise extraction;
[0070] It is important to understand that this embodiment uses a ResT encoder to extract multi-scale encoded features. First, the Stem module extracts low-level features, and then four stages capture multi-scale feature maps in sequence. Each stage consists of three key parts: the Patch Embedding module, the Positional Embedding (PE) module, and the Efficient TransformerBlock, which are used to downsample the feature map and increase the channel dimension, inject position information, and further refine semantic features, respectively, to achieve efficient fusion of global and local information.
[0071] Specifically, in this embodiment, the Stem module is used to extract low-level features of the image to be segmented, and four stages are used to sequentially downsample and increase the channel dimension of the image to be segmented, thereby obtaining four encoded feature maps at different scales, as expressed in the following expression:
[0072]
[0073]
[0074]
[0075]
[0076] in, The image to be segmented, For the number of images, and These represent the height and width of the image, respectively. , , , Extract the backbone network for features from each layer. Four encoded feature maps with progressively decreasing spatial resolution and progressively enhanced semantic information are used. It is the encoded feature map with the highest spatial resolution and the shallowest semantic information. It is the coding feature map with the lowest spatial resolution and the deepest semantic information, thus obtaining a multi-scale coding feature set. }
[0077] Step 2: Input the multi-scale encoded feature set into the decoder for stage-by-stage decoding to obtain the highest resolution decoded features. Then, upsample the highest resolution decoded features by four times to obtain the final decoded features. The decoder includes the MMFusion module, VSSLayer module, PatchExpand module, and FinalPatchExpand module.
[0078] Optionally, obtain initial decoding features. In this embodiment, the initial decoding features for In each decoding stage, the decoding features from the previous stage are input into the PatchExpand module for upsampling by a factor of two to obtain shallow decoding features. ,and Spatial resolution and multi-scale coding features Alignment; where the feature map with the lowest spatial resolution is used as the initial decoding input, and decoding is performed step by step in order from low resolution to high resolution, as expressed in the following expression:
[0079]
[0080]
[0081] Optionally, the MMFusion module is used to fuse shallow decoding features with corresponding scale coding features to obtain the final fused decoding features. The expression is:
[0082]
[0083] The fusion process includes sequentially performing feature alignment, multi-expert feature extraction, and feature enhancement, specifically:
[0084] shallow decoding features Perform 1×1 convolutional projection and normalization to achieve feature alignment, thereby obtaining dimension-aligned decoded features. The expression is:
[0085]
[0086] in, For normalization processing;
[0087] Match shallow decoding features with corresponding scale encoded features Upsampled to the same level using bilinear interpolation. With the same spatial resolution, the encoded features obtained after bilinear interpolation are obtained. and in the channel dimension and By splicing the pieces together, we can obtain the splicing features. The expression is;
[0088]
[0089]
[0090] in, This is a bilinear interpolation operation;
[0091] splicing features The input is fed into the MoEExpert module within the MMFusion module for multi-expert feature extraction, resulting in multi-scale adaptive fusion features. and will The input is fed into VSSBlock for long-range dependency modeling to achieve feature enhancement and obtain the initial fused decoded features. The expression is;
[0092]
[0093]
[0094] Initial fusion decoding features Decoding features aligned with dimensions Perform residual connections to obtain the final fused decoding features. The final fused decoding features obtained This will be used in subsequent steps: inputting into the VSSLayer module for feature enhancement and updating the decoding status.
[0095] Understandably, in order to effectively fuse features, fully leverage the complementarity of multi-scale features, and improve segmentation performance, this embodiment proposes the MMFusion module based on the MoEExpert module. The MMFusion module combines the MoE expert system and the Mamba-based spatiotemporal feature modeling module to achieve end-to-end feature enhancement from multi-scale feature extraction to long-range dependency modeling, effectively improving segmentation accuracy.
[0096] Optionally, the splicing feature The input is fed into the MoEExpert module within the MMFusion module for multi-expert feature extraction, specifically as follows:
[0097] The MoEExpert module consists of multiple parallel expert branches, each employing a different expansion rate (1, 3, 6). Convolutional operations are used to extract feature information from different receptive fields at the same spatial resolution. Each expert branch contains a series of convolutional and normalization layers, forming a non-linear feature map. This is useful for concatenating features. The expression for feature extraction for each expert branch is:
[0098]
[0099] in, For expert feature extraction output, Indicates the expansion rate of Convolution operation, This indicates a batch normalization operation. This represents the GeLU nonlinear activation function, yielding multi-scale expert feature extraction outputs. It will serve as the input for dynamic weight calculation and weighted fusion;
[0100] It is understandable that the purpose of using different dilation rates in this embodiment is to adapt to the three common scales in remote sensing images: low dilation rate is used to capture details of small targets, medium dilation rate focuses on medium-scale structures such as buildings and waterways, while high dilation rate helps to extract scene-level contextual information.
[0101] To achieve adaptive fusion of expert features at different scales, the outputs of each expert feature extraction are summed element-wise to obtain joint features. Then the joint features The input is fed into a shared attention network, where global contextual information is extracted using global average pooling. The expression is as follows:
[0102]
[0103]
[0104] GAP stands for Global Average Pooling. This is for global context information;
[0105] Through two layers Convolution and GeLU activation function generate expert weight vectors And normalization is performed using Softmax, the expression is:
[0106]
[0107] in, for convolution;
[0108] The feature extraction outputs of each expert are weighted and summed according to the expert weight vector to obtain the multi-scale adaptive fusion features of the MoEExpert module. and will Input into VSSBlock for long-range dependency modeling.
[0109] Optionally, the VSSLayer module is used to decode the final fused features. State-space modeling is performed to obtain the decoding features of the current stage, which are then used as input features for the next decoding stage. After completing all decoding stages, the highest resolution decoding features are obtained. and decode the highest resolution features The input is fed into the FinalPatchExpand module for quadruple upsampling to obtain the final decoded features. .
[0110] Understandably, to enhance the model's ability to perceive multi-scale ground features in remote sensing imagery, this embodiment proposes a Dynamic Multi-Scale Expert Module (MoEExpert) to achieve fine-grained modeling and adaptive fusion of ground features at different scales. Furthermore, given the rich variety of ground feature types and significant scale differences in remote sensing imagery, traditional convolutional structures struggle to simultaneously capture both local details and global semantics at a single scale. Therefore, this embodiment designs a module composed of multiple parallel convolutional experts, with each expert branch responsible for perceiving feature information at a specific scale, and achieving pixel-level dynamic weighted fusion through a shared attention mechanism.
[0111] Step 3: Input the final decoded features into the trained semantic segmentation model to obtain the semantic segmentation result output by the semantic segmentation model.
[0112] Optionally, an AdaptiveLocalSupervision module is constructed to train the semantic segmentation model. During the training phase, the decoding features of each decoding stage are... The input will be fed into the AdaptiveLocalSupervision module to generate local auxiliary supervision. and will provide local auxiliary supervision Upsampling to the same spatial size as the input image is used to supervise and constrain the current decoding stage, specifically:
[0113] Decoding features of each decoding stage Three parallel convolutional branches are employed, each using a 3×3 depthwise separable convolution with different dilation rates, to simultaneously capture fine-grained local information, mesoscale context, and large-scale scene information, thereby obtaining multi-scale receptive field extraction features. The expression is:
[0114]
[0115] To enhance the model's ability to focus on spatially significant regions (such as target boundaries and key structures), decoding features at each decoding stage are analyzed. Construct spatial attention weights, on the channel dimension, for Calculate the mean separately Maximum value with standard deviation The three-channel spatial description features were obtained. :
[0116]
[0117] in, For splicing operations;
[0118] spatial description features Input is fed into a spatial attention network to generate a spatial weight graph. The expression is:
[0119]
[0120] Where sigmoid is the activation function. express Convolution operation;
[0121] To adaptively adjust the importance of features from different channels, the decoding features at each decoding stage are... Perform channel attention modeling: Global average pooling is performed, and channel weight vectors are generated through two layers of 1×1 convolutions and non-linear mapping. The expression is:
[0122]
[0123] The output of the three dilated convolution branches Element-wise summation is performed, and spatial attention constraints are introduced to weight the spatially multi-scale local features. The channel-weighted backbone features are concatenated along the channel dimension, as shown in the expression:
[0124]
[0125]
[0126] in, This indicates element-wise multiplication. Features are concatenated along the channel dimensions;
[0127] The concatenated features are fused using 1×1 convolutions to restore the channel dimension and obtain the fused features. The expression is:
[0128]
[0129] Fusion features The input is fed into the prediction head, generating auxiliary segmentation results at the current scale and serving as local auxiliary supervision results. The expression is:
[0130]
[0131] in, This is a result of local auxiliary supervision.
[0132] Understandably, due to significant differences in the scale, morphology, and distribution of features in remote sensing images, deep networks often struggle to fully preserve local details and boundary information while capturing global semantic information. Therefore, this embodiment constructs an AdaptiveLocalSupervision module to provide local auxiliary supervision signals for decoding features at different scales during the decoding process, thereby improving the model's ability to characterize boundaries and details. Through multi-scale convolutional branches and a dual attention mechanism, the model can adaptively focus on key regions and important scale information in remote sensing images, effectively enhancing feature representation and gradients, alleviating the problem of insufficient gradient propagation in deep decoding networks, and guiding each decoding stage to learn discriminative feature representations that match its spatial resolution, thus improving segmentation accuracy.
[0133] Furthermore, the specific implementation process of Step 2 and Step 3 of this embodiment will be described in detail below.
[0134] It is important to understand that the MambaMoENet model proposed in this embodiment consists of two parts: a ResT encoder and a decoder based on Mamba and MoE. The encoder extracts multi-scale features, while the decoder further fuses and enhances these features, outputting the final segmentation result. Within the decoder, this embodiment designs three modules: the MoEExpert module, the MMFusion module, and the AdaptiveLocalSupervision module. These are used for multi-expert feature extraction, shallow-deep feature fusion, and local auxiliary supervision, respectively, aiming to improve the network's ability to model multi-scale information and strengthen gradient propagation.
[0135] Specifically, the decoder obtains the encoder's output multi-scale encoded feature set. The input includes the dimensions of the image to be segmented.
[0136] In the first stage of decoding, from the deepest features... The decoder input is initialized by setting initial decoding features. Defined as The expression is:
[0137]
[0138] First, the decoded features are processed using the PatchExpand module. Performing double upsampling yields decoded features with improved spatial resolution. . Spatial resolution and coding features and Alignment, and This serves as a shallow decoding feature in the current decoding stage.
[0139]
[0140] Next, the obtained decoding features Encoder with corresponding scale The features are input into the MMFusion module. First, the deep and shallow features are concatenated to obtain the concatenated features. splicing features The input is fed into the MoEExpert module, where multi-expert parallel feature extraction is performed to obtain a multi-scale expert output set. The system uses a gating mechanism to weight and fuse the outputs of each expert to generate expert weights, ultimately outputting a multi-expert weighted fusion output. Integrating multi-expert features The input is fed into VSSBlock, a state-space model-based algorithm, to perform long-range dependency modeling and obtain fused decoding features. .
[0141]
[0142]
[0143]
[0144] The fusion decoding features The input is fed into VSSLayer to further model the long-range dependencies in the features, resulting in context-enhanced decoded features. . It will serve as the input feature for the next decoding stage and the input for the local auxiliary supervision module.
[0145]
[0146] During the training phase, The input will be fed into the AdaptiveLocalSupervision module to generate local auxiliary supervision. It is then upsampled to the same spatial size as the input image to provide supervisory constraints for the current decoding stage.
[0147]
[0148] Where C represents the output channel dimension;
[0149] In the second stage of decoding, which is the third layer of the decoder, the input from the first stage of the decoder is first processed. Shallow features are obtained by performing a 2x upsampling using PatchExpand. Shallow features The corresponding deep features of the encoder obtained Together, they are fed into the MMFusion module for feature fusion. In MMFusion, shallow features... and deep features The splicing is performed along the channel dimension to obtain the spliced features. splicing features The input is fed into the MoEExpert module, where multi-expert parallel feature extraction is performed to obtain a multi-scale expert output set. The system uses a gating mechanism to weight and fuse the outputs of each expert to generate expert weights, ultimately outputting a multi-expert weighted fusion output. Integrating multi-expert features The input is fed into VSSBlock, a state-space model-based algorithm, to perform long-range dependency modeling and obtain fused decoding features. .
[0150]
[0151]
[0152]
[0153]
[0154] The fusion decoding features The input is fed into VSSLayer to further model the long-range dependencies in the features, resulting in context-enhanced decoded features. . These will serve as input features for the next decoding stage.
[0155]
[0156] During the training phase, The input will be fed into the AdaptiveLocalSupervision module to generate local auxiliary supervision. It is then upsampled to the same spatial size as the input image to provide supervisory constraints for the current decoding stage.
[0157]
[0158] In the third stage of decoding, which is the second layer of the decoder, the input from the first stage of the decoder is first processed. Shallow features are obtained by performing a 2x upsampling using PatchExpand. Shallow features The corresponding deep features of the encoder obtained Together, they are fed into the MMFusion module for feature fusion. In MMFusion, shallow features... and deep features The splicing is performed along the channel dimension to obtain the spliced features. splicing features The input is fed into the MoEExpert module, where multi-expert parallel feature extraction is performed to obtain a multi-scale expert output set. The system uses a gating mechanism to weight and fuse the outputs of each expert to generate expert weights, ultimately outputting a multi-expert weighted fusion output. Integrating multi-expert features The input is fed into VSSBlock, a state-space model-based algorithm, to perform long-range dependency modeling and obtain fused decoding features. .
[0159]
[0160]
[0161]
[0162]
[0163] The fusion decoding features The input is fed into VSSLayer to further model the long-range dependencies in the features, resulting in context-enhanced decoded features. . These will serve as input features for the next decoding stage.
[0164]
[0165] During the training phase, The input will be fed into the AdaptiveLocalSupervision module to generate local auxiliary supervision. It is then upsampled to the same spatial size as the input image to provide supervisory constraints for the current decoding stage.
[0166]
[0167] In the final decoding stage, that is, the last layer of the decoder, the input from the second layer of the decoder is processed. The final segmentation result is obtained by upsampling by 4 times using FinalPatchExpand and then using the segmentation head.
[0168]
[0169]
[0170] in, For split head operations;
[0171] Optionally, in this embodiment, decoding includes a training mode and an inference mode, wherein the final segmentation result is output in the training mode. and the output of all AdaptiveLocalSupervision modules The weighted summation only outputs the final segmentation result in the inference mode. .
[0172] The specific embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A method for remote sensing image semantic segmentation based on state space modeling and hybrid expert mechanism, characterized in that, The method includes the following steps: Step 1: Input the image to be segmented into the encoder network for stepwise feature extraction to obtain a multi-scale encoded feature set; wherein, the encoder uses a ResT network as the feature extraction backbone, and the ResT network includes a Stem module and four stages for stepwise extraction; Step 2: Input the multi-scale encoded feature set into the decoder for stage-by-stage decoding to obtain the highest resolution decoded features. Then, upsample the highest resolution decoded features by four times to obtain the final decoded features. The decoder includes the MMFusion module, VSSLayer module, PatchExpand module, and FinalPatchExpand module. Step 3: Input the final decoded features into the trained semantic segmentation model to obtain the semantic segmentation result output by the semantic segmentation model.
2. The method of claim 1, wherein, Step 1 specifically refers to: The Stem module is used to extract low-level features from the image to be segmented. Four stages are then used to progressively downsample and increase the channel dimension of the image to be segmented, resulting in four encoded feature maps at different scales. The expression is as follows: ; ; ; ; wherein, is an image to be segmented, is the number of images, and respectively represent the height and width of the image, , , , is a feature extraction backbone network for each layer, is an encoded feature map with four spatial resolutions gradually reduced and semantic information gradually enhanced, and is an encoded feature map with the highest spatial resolution and the shallowest semantic information, is an encoded feature map with the lowest spatial resolution and the deepest semantic information, thereby obtaining a multi-scale encoded feature set }.
3. The method of claim 2, wherein, In each decoding stage, the decoding features from the previous stage are input into the PatchExpand module for double upsampling to obtain shallow decoding features. The feature map with the lowest spatial resolution is used as the initial decoding input, and the decoding is performed step by step in order from low resolution to high resolution.
4. The method of claim 3, wherein, The MMFusion module is used to fuse shallow decoding features with corresponding scale encoded features. The fusion includes sequentially performing feature alignment, multi-expert feature extraction, and feature enhancement, specifically: Shallow layer decoding features A 1x1 convolutional projection and normalization is performed to achieve feature alignment, resulting in dimensionally aligned decoding features ; Match shallow decoding features with corresponding scale encoded features Upsampled to the same level using bilinear interpolation. With the same spatial resolution, the encoded features obtained after bilinear interpolation are obtained. and in the channel dimension and By splicing the pieces together, we can obtain the splicing features. ; The spliced features are input into a MoE Expert module in the MMFusion module for multi-expert feature extraction to obtain multi-scale adaptive fusion features , and the multi-scale adaptive fusion features are input into a VSSBlock for feature enhancement to obtain initial fusion decoding features ; decoding features decoding features aligned with dimensions performing residual connection to obtain final fused decoding features .
5. The remote sensing image semantic segmentation method based on state-space modeling and hybrid expert mechanism according to claim 4, characterized in that, The concatenation feature The multi-expert feature extraction in the MoE Expert module input into the MMFusion module is specifically: The MoE Expert module is composed of multiple parallel expert branches, each of which adopts different expansion rates convolution operation, for the spliced features The expression of feature extraction performed by each expert branch is: ; in, For expert feature extraction output, Indicates the expansion rate of Convolution operation, This indicates a batch normalization operation. This represents the GeLU nonlinear activation function; The feature extraction outputs from each expert are summed element-wise to obtain the joint features. Then the joint features The input is fed into a shared attention network, where global contextual information is extracted using global average pooling. The expression is as follows: ; ; GAP stands for Global Average Pooling. This is for global context information; Through two layers Convolution and GeLU activation function generate expert weight vectors And normalization is performed using Softmax, the expression is: ; in, for convolution, This represents the GeLU nonlinear activation function; The feature extraction outputs of each expert are weighted and summed according to the expert weight vector to obtain the multi-scale adaptive fusion features of the MoEExpert module. .
6. The remote sensing image semantic segmentation method based on state-space modeling and hybrid expert mechanism according to claim 5, characterized in that, The VSSLayer module is used to perform the final fusion decoding features. State-space modeling is performed to obtain the decoding features of the current stage, which are then used as input features for the next decoding stage. After completing all decoding stages, the highest resolution decoding features are obtained. and decode the highest resolution features The input is fed into the FinalPatchExpand module for quadruple upsampling to obtain the final decoded features. .
7. The remote sensing image semantic segmentation method based on state-space modeling and hybrid expert mechanism according to claim 6, characterized in that, The AdaptiveLocalSupervision module is constructed to train the semantic segmentation model, specifically as follows: Decoding features of each decoding stage Three parallel convolutional branches are employed, each using a 3×3 depthwise separable convolution with different dilation rates, to obtain multi-scale receptive field extraction features. The expression is: ; Decoding features of each decoding stage Construct spatial attention weights, on the channel dimension, for Calculate the mean separately Maximum value with standard deviation The three-channel spatial description features were obtained. : ; in, For splicing operations; spatial description features Input is fed into a spatial attention network to generate a spatial weight graph. The expression is: ; Where sigmoid is the activation function. express Convolution operation; Decoding features of each decoding stage Perform channel attention modeling: Global average pooling is performed, and channel weight vectors are generated through two layers of 1×1 convolutions and non-linear mapping. The expression is: ; in, express Convolution operation; The output of the three dilated convolution branches Element-wise summation is performed, and spatial attention constraints are introduced to weight the spatially multi-scale local features. The channel-weighted backbone features are concatenated along the channel dimension, as shown in the expression: ; ; in, This indicates element-wise multiplication. Features are concatenated along the channel dimensions; The concatenated features are fused using 1×1 convolutions to restore the channel dimension and obtain the fused features. The expression is: ; Fusion features The input is fed into the prediction head, generating auxiliary segmentation results at the current scale and serving as local auxiliary supervision results. The expression is: ; in, This is a result of local auxiliary supervision.