A method and apparatus for segmenting remote sensing images

By fusing multi-scale features from remote sensing images and single-channel images of digital surface models in the encoder, and utilizing an adaptive gating mechanism and a state-space mixer, the problem of insufficient capture of long-distance dependencies and global contextual information in remote sensing image segmentation is solved, achieving higher-precision segmentation results.

CN122244446APending Publication Date: 2026-06-19CHANGSHA UNIVERSITY OF SCIENCE AND TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHANGSHA UNIVERSITY OF SCIENCE AND TECHNOLOGY
Filing Date
2026-03-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing remote sensing image segmentation techniques, convolutional neural networks struggle to effectively capture long-range dependencies and low-frequency global contextual information of multimodal features, resulting in low segmentation accuracy.

Method used

By fusing multi-scale features from remote sensing images and single-channel maps of digital surface models in the encoder, and utilizing an adaptive gating mechanism and a state-space mixer, high-frequency details and mid-frequency structures are captured. Furthermore, location dependencies are generated through adaptive weights and soft attention weights, thereby achieving adaptive fusion and recombination of multi-scale features.

🎯Benefits of technology

It significantly improves the segmentation accuracy and robustness of remote sensing images, effectively captures multi-scale features and long-range dependencies, enhances global contextual understanding, and overcomes the locality limitations of traditional convolutional neural networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244446A_ABST
    Figure CN122244446A_ABST
Patent Text Reader

Abstract

This application relates to a method and apparatus for segmenting remote sensing images. The method first enhances gradient information using a DSM data augmentation module, then fuses multi-scale features from a first and second image. Next, 3x3 and 5x5 separable convolutions are treated as a set of efficient filters to capture high-frequency details and mid-frequency structures in the multi-scale features of the image. Based on a state-space model of tensor decomposition, feature components of the state space are extracted from the feature map, thereby generating position-dependent soft attention weights and a state transition matrix to capture long-distance dependencies in the feature map. Channel attention is then used for feature enhancement. Subsequently, an adaptive gating mechanism generates weighting factors to achieve adaptive feature fusion between the encoder and decoder, finally yielding the image segmentation result. Through the above design, this application can improve the segmentation accuracy of remote sensing images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of remote sensing image segmentation technology, and in particular to a method and apparatus for segmenting remote sensing images. Background Technology

[0002] With the continuous advancement of remote sensing technology, multi-source heterogeneous data such as optical images, multispectral data, and digital terrain models are playing an increasingly important role in tasks such as urban planning, disaster assessment, road extraction, and feature classification. While these datasets provide multi-view information for Earth observation, they also introduce significant differences in modes and scales: spectral, texture, and elevation information acquired by different sensors exhibits heterogeneous characteristics, and objects in images often show extreme scale variations. How to fully integrate multimodal features and capture global contextual information in complex scenes to achieve high-precision semantic segmentation remains a major challenge.

[0003] To address these challenges, deep learning technology has become the mainstream method for semantic segmentation of remote sensing images. Convolutional neural networks, as the cornerstone of remote sensing image analysis, are adept at extracting texture and spatial features through local convolution operations. Essentially, they act as a set of learnable high-pass filters, effectively capturing high-frequency details of ground objects. However, the inherent locality of convolution operations limits their receptive field, making it difficult to model long-distance dependencies between pixels. This results in insufficient capture of low-frequency global contextual information and a deficiency in segmentation accuracy. Summary of the Invention

[0004] A first aspect of this application provides a method for segmenting remote sensing images, the method comprising: Determine a first image and a second image; wherein the first image is a remote sensing image, and the second image is a single-channel image of a digital surface model corresponding to the first image; The first image and the second image are input into a segmentation model to obtain the segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: The encoder extracts a multi-scale first feature map from the first image and a multi-scale second feature map from the second image, and fuses the first and second feature maps of corresponding scales to obtain a multi-scale third feature map. The encoder extracts a multi-scale fourth feature map from the multi-scale third feature map. The process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map based on depthwise convolution and adding the first intermediate feature map to the third feature map pixel-by-pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map based on a 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map based on a 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map based on a 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps based on an adaptive gating mechanism, and weighting and fusing the fourth and fifth intermediate feature maps according to the adaptive weights. The fourth and fifth intermediate feature maps are used to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. The segmentation result of the first image is obtained by fusing the multi-scale fourth feature map through the decoder.

[0005] The remote sensing image segmentation method provided in this application has at least the following beneficial effects: This application effectively integrates multimodal information by fusing multi-scale features of the first and second images early in the encoder stage to generate a multi-scale third feature map. In acquiring the fourth feature map output by the encoder, a second intermediate feature map is first extracted from the third feature map. Then, a multi-scale state-space mixer is introduced. This mixer treats 3x3 and 5x5 separable convolutions as a set of efficient high-pass and band-pass filters specifically for capturing high-frequency details and mid-frequency structures in the image. An adaptive gating mechanism is then used to generate weighting factors to dynamically adjust the importance of features at different scales, achieving adaptive feature fusion and obtaining a sixth intermediate feature map. Finally, a state-space model based on tensor decomposition is introduced to further integrate multimodal information from the first image. The first feature component of the state space is extracted from the sixth intermediate feature map, and position-dependent soft attention weights and a state transition matrix are generated accordingly. The state transition matrix is ​​used together with the first feature component and the first intermediate feature map to generate the seventh intermediate feature map. This effectively captures long-distance dependencies in the image during the feature extraction stage, which significantly overcomes the shortcomings of traditional convolutional neural networks in capturing low-frequency global context information and avoids the problem of low segmentation accuracy caused by locality limitations. Then, channel attention is used to enhance important features, and features are further processed through projection and gating mechanisms to obtain the fourth feature map output by the encoder. Finally, the fourth feature map of multiple scales is fused by the decoder to obtain the segmentation result of the first image.

[0006] In summary, this application first extracts features from remote sensing images and DSM images, and then performs multimodal fusion through an encoder. In particular, when obtaining the output features of the encoder, a multi-scale state-space mixer is introduced, treating 3x3 and 5x5 separable convolutions as a set of efficient high-pass and band-pass filters specifically for capturing high-frequency details and mid-frequency structures in the image. At the same time, utilizing its global receptive field and linear recursive characteristics, it acts as an intelligent low-pass filter, effectively modeling long-term dependencies and macroscopic semantics with linear complexity. Through learnable gating weights, adaptive fusion and recombination of the entire spectral features from high to low frequencies are achieved. This design enables the multi-scale state-space mixer to comprehensively cover the spectral information from local details to global layout in remote sensing images, effectively capturing multi-scale features and long-range dependencies while fully mining complementary information of different modal features, significantly improving the segmentation effect of remote sensing images.

[0007] A second aspect of this application provides a segmentation apparatus for remote sensing images, the apparatus comprising: An image acquisition module is used to determine a first image and a second image; wherein the first image is a remote sensing image, and the second image is a single-channel image of a digital surface model corresponding to the first image; An image segmentation module is used to input the first image and the second image into a segmentation model to obtain a segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: The encoder extracts a multi-scale first feature map from the first image and a multi-scale second feature map from the second image, and fuses the first and second feature maps of corresponding scales to obtain a multi-scale third feature map. The encoder extracts a multi-scale fourth feature map from the multi-scale third feature map. The process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map based on depthwise convolution and adding the first intermediate feature map to the third feature map pixel-by-pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map based on a 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map based on a 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map based on a 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps based on an adaptive gating mechanism, and weighting and fusing the fourth and fifth intermediate feature maps according to the adaptive weights. The fourth and fifth intermediate feature maps are used to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. The segmentation result of the first image is obtained by fusing the multi-scale fourth feature map through the decoder.

[0008] A third aspect of this application provides an electronic device including at least one controller and a memory for communicatively connecting to the controller; the memory stores instructions executable by the at least one controller to cause the at least one controller to perform a remote sensing image segmentation method as described in the first aspect of this application.

[0009] A fourth aspect of this application provides a computer-readable storage medium storing computer-executable instructions for causing a computer to perform a remote sensing image segmentation method as described in the first aspect of this application.

[0010] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0011] Figure 1 This is a schematic diagram of a remote sensing image segmentation method provided in an embodiment of this application; Figure 2 This is a schematic diagram of the segmentation model provided in the embodiments of this application; Figure 3 yes Figure 2 A schematic diagram of the VFE module in the diagram; Figure 4 yes Figure 2 A schematic diagram of the ASAF module in the image; Figure 5 This is a schematic diagram illustrating qualitative performance comparison on the Vaihingen test set provided in an embodiment of this application; Figure 6 This is a schematic diagram illustrating qualitative performance comparison on the Potsdam test set provided in an embodiment of this application; Figure 7 This is a visual schematic diagram of the ablation experiment on the Vaihingen test set provided in the embodiments of this application; Figure 8 This is a schematic diagram of a remote sensing image segmentation device provided in an embodiment of this application; Figure 9 This is a schematic diagram of the electronic device provided in the embodiments of this application. Detailed Implementation

[0012] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0013] In the description of this application, the use of terms such as "first," "second," etc., is for the purpose of distinguishing technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or the order of the technical features indicated.

[0014] like Figure 1 One embodiment of this application provides a method for segmenting remote sensing images, the method comprising: Step S110: Determine the first image and the second image; wherein the first image is a remote sensing image and the second image is a single-channel image of the digital surface model corresponding to the first image; Step S120: Input the first image and the second image into the segmentation model to obtain the segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: Step S1210: Extract a multi-scale first feature map from the first image and a multi-scale second feature map from the second image using an encoder, and fuse the first and second feature maps of the corresponding scales to obtain a multi-scale third feature map. Step S1220: Extract a multi-scale fourth feature map from the multi-scale third feature map using an encoder; wherein, the process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map based on depthwise convolution and adding the first intermediate feature map to the third feature map pixel by pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map based on 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map based on 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map based on 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps based on an adaptive gating mechanism, and adjusting the weights according to the adaptive gating mechanism. The fourth and fifth intermediate feature maps are weighted and fused to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. Step S1230: The fourth feature map of multiple scales is fused through the decoder to obtain the segmentation result of the first image.

[0015] In this embodiment, remote sensing images generally refer to Earth surface image data acquired from aerial or space platforms through remote sensing technology. These images may contain information in multiple bands such as visible light, infrared, multispectral, or hyperspectral, and are used for applications such as ground feature classification and target recognition.

[0016] A digital surface model single-channel image (DSM) is a single-channel grayscale image that represents surface elevation information. It records the height information of surface objects and can provide important three-dimensional structural context for semantic segmentation of remote sensing images.

[0017] A segmentation model is a deep learning model consisting of an encoder and a decoder. The encoder is responsible for extracting multi-level, multi-scale feature representations from the input image. It gradually reduces the spatial resolution of the feature map through a series of convolutional and pooling layers while increasing the number of feature channels to capture the semantic information of the image. The decoder is responsible for restoring the abstract feature map extracted by the encoder to the resolution of the original image and generating the final pixel-level classification result. It typically combines high-level semantic information with low-level spatial details through upsampling operations and skip connections.

[0018] In this embodiment, the first image in step S110 is a remote sensing image, and the second image is a single-channel digital surface model (DSM) image corresponding to the first image. These images can be acquired, for example, by manually selecting an image file, retrieving it from a preset image database, or by directly capturing it using a sensor. As one implementation, the operator can manually specify a remote sensing image and a corresponding DSM single-channel image through a graphical user interface.

[0019] Subsequently, the segmentation model in step S120 includes an encoder and a decoder. Before inputting the image, the image data can be preprocessed, such as normalization and resizing, to make it meet the input requirements of the model. For example, the remote sensing image and the DSM single-channel image can be adjusted to the fixed input size required by the segmentation model and the pixel values ​​can be normalized before being sent to the input interface of the segmentation model.

[0020] The process by which the segmentation model outputs the segmentation result of the first image includes: The encoder can use a series of standard convolutional and pooling layers to extract feature maps of different scales. Feature fusion can be achieved by element-wise addition, concatenation, or element-wise multiplication. For example, the encoder can contain multiple downsampling modules, each consisting of a convolutional layer and a pooling layer, thereby generating feature maps of different scales at different levels. For fusion, the first and second feature maps of the corresponding scale can be concatenated along the channel dimension to form a third feature map.

[0021] The process of extracting a multi-scale fourth feature map from a multi-scale third feature map using an encoder, and extracting a fourth feature map of arbitrary scale, includes: The first intermediate feature map is extracted from the third feature map using depthwise convolution, and then the first intermediate feature map is added pixel-by-pixel to the third feature map to obtain the second intermediate feature map. Depthwise convolution can be performed using a single depthwise convolutional layer, with the kernel size and stride set according to requirements. Pixel-by-pixel addition can be directly implemented using tensor addition operations. For example, a 3x3 depthwise convolutional kernel can be used to convolve the third feature map to obtain the first intermediate feature map. Then, the first intermediate feature map and the third feature map are joined pixel-by-pixel using residual concatenation to preserve the original information and enhance feature representation.

[0022] In this embodiment, 1x1 convolutions can be used to adjust the number of channels or perform feature compression, while 3x3 and 5x5 depthwise convolutions are applied in parallel to capture information from different receptive fields. For example, a 1x1 convolutional layer is first used to reduce or increase the dimensionality of the second intermediate feature map to obtain a third intermediate feature map. Subsequently, a 3x3 depthwise convolutional layer and a 5x5 depthwise convolutional layer are applied in parallel to the third intermediate feature map to generate a fourth and a fifth intermediate feature map, respectively.

[0023] The adaptive gating mechanism generates adaptive weights for the fourth and fifth intermediate feature maps, and then the fourth and fifth intermediate feature maps are weighted and fused according to the adaptive weights to obtain the sixth intermediate feature map. The adaptive gating mechanism can be a fully connected layer or a convolutional layer, whose input is the combined features of the fourth and fifth intermediate feature maps, and whose output is the weights of the two feature maps.

[0024] The sixth intermediate feature map tensor is decomposed into first feature components in the state space. Position-dependent soft attention weights are generated based on these first feature components, and a state transition matrix is ​​generated from these soft attention weights. Tensor decomposition can employ methods such as nonnegative tensor decomposition or principal component analysis to decompose the sixth intermediate feature map into multiple first feature components. The position-dependent soft attention weights are generated based on these first feature components, and the network output is a weight map with the same size as the feature map space. The state transition matrix is ​​generated based on the soft attention weights, for example, through matrix multiplication or convolution operations.

[0025] A seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The generation of the seventh intermediate feature map can be achieved through a combination of matrix multiplication, convolution, or attention mechanisms. For example, the state transition matrix can be multiplied by the first feature component to obtain an updated feature representation. This updated representation can then be element-wise added to or concatenated with the first intermediate feature map, and finally passed through a convolutional layer to obtain the seventh intermediate feature map.

[0026] The seventh intermediate feature map is enhanced using a channel attention mechanism to obtain the eighth intermediate feature map. A gating mechanism is then used to extract the ninth intermediate feature map from the eighth intermediate feature map. Finally, a projection mechanism is used to interact the ninth intermediate feature map with the first feature component to obtain the tenth intermediate feature map. The channel attention mechanism can use global average pooling and two fully connected layers to generate channel weights. The gating mechanism can be a sigmoid activation function to control the information flow. The projection mechanism can be a linear layer or a convolutional layer. For example, the channel attention mechanism can first perform global average pooling on the seventh intermediate feature map, then generate weights for each channel through two fully connected layers, and multiply these weights with the seventh intermediate feature map channel by channel to obtain the eighth intermediate feature map. Subsequently, a gating unit is applied to the eighth intermediate feature map to extract the ninth intermediate feature map. Finally, a linear projection layer interacts the ninth intermediate feature map with the first feature component, for example, by concatenating the layers and then performing convolution to obtain the tenth intermediate feature map.

[0027] The fourth feature map can be generated directly from the tenth intermediate feature map using one or more convolutional layers. For example, a 3x3 convolutional layer can be used to process the tenth intermediate feature map to generate the final fourth feature map.

[0028] Finally, the decoder fuses the multi-scale fourth feature maps to obtain the segmentation result of the first image. The decoder can employ a series of upsampling layers and convolutional layers, combined with skip connections, to fuse feature maps of different scales. The fusion method can be a simple concatenation followed by convolution. For example, the decoder can start from the smallest scale fourth feature map, progressively upsample, and in each upsampling step, concatenate the current scale fourth feature map with the upsampled feature map, and then fuse them through convolutional layers to finally generate a segmentation result with the same size as the original image.

[0029] The method in this embodiment effectively integrates multimodal information by fusing multi-scale features from the first image (remote sensing image) and the second image (single-channel image of digital surface model) early in the encoder stage to generate the third feature map. Furthermore, during the extraction of the fourth feature map, a state-space model based on tensor decomposition is introduced to extract the first feature component of the state space from the sixth intermediate feature map. Based on this, position-dependent soft attention weights and a state transition matrix are generated. The state transition matrix is ​​used together with the first feature component and the first intermediate feature map to generate the seventh intermediate feature map. This effectively captures long-range dependencies in the image during the feature extraction stage, significantly overcoming the shortcomings of traditional convolutional neural networks in capturing low-frequency global contextual information and avoiding the problem of low segmentation accuracy due to locality limitations. Furthermore, this embodiment employs an adaptive gating mechanism, which generates adaptive weights for the fourth and fifth intermediate feature maps and performs weighted fusion based on these weights to obtain the sixth intermediate feature map. This adaptive fusion method enables the model to dynamically adjust the contribution of features at different scales according to the actual content of the input features, thereby better adapting to complex scenarios in remote sensing images where the scale of objects varies greatly and the features are highly heterogeneous.

[0030] In summary, this embodiment, through innovative multimodal feature fusion strategies, the introduction of a state-space model to capture long-range dependencies, and the adoption of an adaptive gating mechanism to achieve dynamic multi-scale feature fusion, constitutes an efficient and robust remote sensing image segmentation scheme. This embodiment demonstrates superior performance compared to existing technologies in integrating multi-source heterogeneous data, enhancing global contextual understanding, and adapting to complex scene feature changes, thereby improving the accuracy and generalization ability of remote sensing image semantic segmentation.

[0031] In some embodiments of this application, step S1220, generating the fourth feature map based on the tenth intermediate feature map, includes: Step S1221: Connect the residuals of the tenth intermediate feature map and the second intermediate feature map to obtain the eleventh intermediate feature map; Step S1222: Extract the twelfth intermediate feature map from the eleventh intermediate feature map based on 3x3 convolution, and concatenate the residuals of the twelfth intermediate feature map with the eleventh intermediate feature map to obtain the thirteenth intermediate feature map; Step S1223: Extract the fourteenth intermediate feature map from the thirteenth intermediate feature map based on the feedforward neural network, and enhance the fourteenth intermediate feature map based on the SE attention mechanism to obtain the fifteenth intermediate feature map; Step S1224: Connect the residuals of the fifteenth intermediate feature map and the thirteenth intermediate feature map to obtain the fourth feature map.

[0032] By performing a residual connection between the tenth intermediate feature map (features processed through multiple complex transformations and attention mechanisms) and the second intermediate feature map (features from an earlier stage that contain more original spatial information), the detailed information of the original features can be preserved, while introducing features processed by advanced semantics, thereby enriching the expression of the features.

[0033] More refined texture and edge information can be captured through 3x3 convolution. Subsequently, the extracted twelfth intermediate feature map is re-connected with the eleventh intermediate feature map. This structure helps to maintain the flow of information during feature extraction and allows the network to review and utilize features from the previous stage while learning new features, thereby enhancing the robustness and expressiveness of the features.

[0034] Feedforward neural networks can be used to recalibrate or transform the 13th intermediate feature map along the channel dimension. SE attention is a channel attention mechanism that adaptively adjusts the weights of different channels, thereby enhancing useful feature channels and suppressing unimportant ones. By enhancing the 14th intermediate feature map using SE attention, the network can focus more on feature channels beneficial to the segmentation task, improving the feature discrimination ability.

[0035] The 15th intermediate feature map, enhanced by the feedforward neural network and SE attention mechanism, is fused with the 13th intermediate feature map. This fusion combines the refined and channel-enhanced features with the original features. This multi-residual connection design aims to ensure that feature information at different stages can be effectively preserved and utilized while performing feature depth processing, avoiding information loss, and further improving the quality and expressive power of the final fourth feature map, so that it can better serve the subsequent decoder for image segmentation.

[0036] This embodiment effectively integrates multimodal information by fusing multi-scale features of the first and second images early in the encoder stage to generate a multi-scale third feature map. In acquiring the fourth feature map output by the encoder, a second intermediate feature map is first extracted from the third feature map. Then, a multi-scale state-space mixer is introduced. This mixer treats 3x3 and 5x5 separable convolutions as a set of efficient high-pass and band-pass filters specifically for capturing high-frequency details and mid-frequency structures in the image. An adaptive gating mechanism is then used to generate weighting factors to dynamically adjust the importance of features at different scales, achieving adaptive feature fusion and obtaining a sixth intermediate feature map. Finally, a state-space model based on tensor decomposition is introduced. The first feature component of the state space is extracted from the sixth intermediate feature map, and position-dependent soft attention weights and a state transition matrix are generated accordingly. The state transition matrix is ​​used together with the first feature component and the first intermediate feature map to generate the seventh intermediate feature map. This effectively captures long-distance dependencies in the image during the feature extraction stage, which significantly overcomes the shortcomings of traditional convolutional neural networks in capturing low-frequency global context information and avoids the problem of low segmentation accuracy caused by locality limitations. Then, channel attention is used to enhance important features, and features are further processed through projection and gating mechanisms to obtain the fourth feature map output by the encoder. Finally, the multi-scale fourth feature map is fused by the decoder to obtain the segmentation result of the first image.

[0037] In summary, this embodiment first extracts features from remote sensing images and DSM images, and then performs multimodal fusion through an encoder. Specifically, when obtaining the encoder's output features, a multi-scale state-space mixer is introduced, treating 3x3 and 5x5 separable convolutions as a set of efficient high-pass and band-pass filters specifically for capturing high-frequency details and mid-frequency structures in the image. Simultaneously, utilizing its global receptive field and linear recursive characteristics, it acts as an intelligent low-pass filter, effectively modeling long-term dependencies and macroscopic semantics with linear complexity. Through learnable gating weights, adaptive fusion and recombination of the entire spectral features from high to low frequencies are achieved. This design enables the multi-scale state-space mixer to comprehensively cover the spectral information from local details to global layout in remote sensing images. It effectively captures multi-scale features and long-range dependencies while fully mining complementary information from different modal features, significantly improving the segmentation effect of remote sensing images.

[0038] In some embodiments of this application, step S1210, extracting a multi-scale second feature map from the second image, includes: Step S1211: Extract the horizontal and vertical gradients of the second image based on the Sobel operator; Step S1212: Normalize the horizontal and vertical gradients to obtain the gradient magnitude; Step S1213: The gradient magnitude is fused with the second image to obtain the enhanced second image; Step S1214: Extract the second feature map at the first scale from the enhanced second image based on the CBR block; the CBR block includes cascaded convolutional layers, BN layers and ReLU activation functions; Step S1215: Based on the feature extraction block, extract a second feature map with a higher scale than the first scale from the second feature map at the first scale; Step S1210, which involves extracting a multi-scale first feature map from the first image, includes: Step S1216: Extract the first feature map of the first scale from the first image based on the CBR block; Step S1217: Based on the feature extraction block, extract a first feature map of a higher scale than the first scale from the first feature map of the first scale.

[0039] The gradient magnitude is obtained by calculating the square root of the sum of the squares of the horizontal and vertical gradients. Then, the gradient magnitude is fused with the second image to combine the original height information of the digital surface model image with its edge strength information, thereby generating an enhanced second image with richer information. This fusion can be performed by concatenating the gradient magnitude as an additional channel with the original second image, or by weighted summation pixel by pixel. The enhanced second image is a second image that incorporates the gradient magnitude information. The image contains the height information of the original DSM and the edge strength information extracted by the Sobel operator, providing richer and more discriminative input for subsequent feature extraction.

[0040] CBR consists of cascaded convolutional layers, batch normalization (BN) layers, and ReLU activation functions. The convolutional layers are used to extract local features of the image; the BN layers are used to accelerate network training and improve the model's generalization ability, preventing overfitting; the ReLU activation function introduces non-linearity and enhances the model's expressive power. The CBR block can effectively extract preliminary, low-level feature representations from the input image.

[0041] The second feature map at the first scale represents the feature information of the enhanced second image after preliminary convolution processing, and it forms the basis for subsequent multi-scale feature extraction. Feature extraction blocks can progressively increase the receptive field of the feature map and extract feature information at different scales through a series of operations such as convolution, pooling, or residual connections. Besides residual blocks, feature extraction blocks can also be combinations of stacked convolutional layers and pooling layers.

[0042] The second feature map, which is at a higher scale than the first scale, refers to the feature map that is further extracted from the second feature map at the first scale through feature extraction blocks. It has a larger receptive field and higher semantic information. These feature maps at different scales are crucial for capturing targets of different sizes and contextual information in the image.

[0043] This embodiment significantly improves the accuracy and robustness of remote sensing image segmentation. By applying the Sobel operator to the digital surface model image and fusing its gradient information with the single-channel DSM image, the segmentation model can more effectively capture key structural features such as terrain changes and building edges when processing the second image, overcoming the limitations of single-channel information representation in the DSM image. It also introduces a strong inductive bias, limiting the hypothesis space used by the model to learn object boundaries, reducing the complexity of boundary modeling and suppressing instability caused by appearance noise. Then, CBR blocks are used for initial feature extraction on both the first and enhanced second images, followed by multi-scale feature extraction through feature extraction blocks. This ensures that the encoder obtains rich and hierarchical feature representations. These refined and multi-scale extracted feature maps provide high-quality input for feature fusion in the subsequent encoder, enabling the segmentation model to more accurately identify and distinguish different land cover categories in the image, thus effectively solving the problem of low segmentation accuracy caused by insufficient feature extraction in traditional methods.

[0044] In some embodiments of this application, the feature extraction block includes a ResBlock block; the ResBlock block includes cascaded 1x1 convolutional layers, 3x3 convolutional layers, and 1x1 convolutional layers; Step S1215, which involves extracting a second feature map at a higher scale than the first scale from the second feature map based on the feature extraction block, includes: Step S1215-1: Extract the second feature map at the second scale from the second feature map at the first scale based on the three stacked ResBlock blocks; Step S1215-2: Extract the second feature map of the third scale from the second feature map of the second scale based on the four stacked ResBlock blocks; Step S1215-3: Extract the second feature map of the fourth scale from the second feature map of the third scale based on the six stacked ResBlock blocks; the scale gradually increases from the first scale to the fourth scale. Step S1217, based on the feature extraction block, extracts a first feature map at a higher scale than the first scale from the first feature map at the first scale, including: Step S1217-1: Extract the first feature map of the second scale from the first feature map of the first scale based on the three stacked ResBlock blocks; Step S1217-2: Extract the first feature map of the third scale from the first feature map of the second scale based on the four stacked ResBlock blocks. Step S1217-3: Extract the first feature map of the fourth scale from the first feature map of the third scale based on the six stacked ResBlock blocks.

[0045] The core idea of ​​ResBlock is to introduce skip connections to allow information to be passed directly from the previous layer to the next, thereby alleviating the problems of vanishing or exploding gradients during the training of deep neural networks and enabling the network to be trained to greater depths. ResBlock blocks can effectively learn the residual mapping between input and output features, thus improving the network's feature extraction capabilities.

[0046] The ResBlock consists of cascaded 1x1, 3x3, and 1x1 convolutional layers. This structure is often referred to as a bottleneck residual block. The first 1x1 convolutional layer is used to reduce the number of channels in the feature map to reduce computation; the 3x3 convolutional layer is used for the main feature extraction; and the second 1x1 convolutional layer is used to restore the number of channels in the feature map and can further integrate features.

[0047] From the second feature map at the first scale, a second feature map at the second scale can be extracted using three stacked ResBlock blocks. Stacking multiple ResBlock blocks deepens the network layers, allowing the model to learn more abstract, higher-level feature representations. Each ResBlock block, through its internal convolutional operations and residual connections, progressively refines and enriches the feature information when processing the feature map. From the second feature map at the second scale, a second feature map at the third scale can be extracted using four stacked ResBlock blocks. Similar to extracting the second-scale feature map, by increasing the number of ResBlock blocks, the network can further deepen its understanding and abstraction of features. This design of progressively increasing the number of stacked blocks helps capture information of different granularities at different scales, thus providing richer context for subsequent feature fusion. From the second feature map at the third scale, a second feature map at the fourth scale can be extracted using six stacked ResBlock blocks. By stacking more ResBlock blocks, the highest-scale feature maps can be obtained. These feature maps typically contain the most abstract and semantic information, which is crucial for understanding high-level semantic concepts in images. The scales from the first to the fourth scale can be progressively increased. The scale here usually refers to the gradual decrease in the spatial resolution of the feature map and the gradual increase in the receptive field, thereby enabling the capture of a wider range of contextual information.

[0048] In this embodiment, each ResBlock consists of cascaded 1x1, 3x3, and 1x1 convolutional layers. This bottleneck structure design allows the feature extraction process to effectively learn deep representations of the input features while maintaining computational efficiency. When extracting multi-scale features from the first and second images, the method generates feature maps of different scales by stacking different numbers of ResBlock blocks. This design of progressively increasing the number of stacked ResBlock blocks allows the network to gradually deepen as the scale increases (i.e., the spatial resolution decreases and the receptive field increases), thereby extracting and abstracting features at a deeper level.

[0049] Through this hierarchical and progressive feature extraction strategy, the encoder can effectively capture multi-scale features from fine-grained to coarse-grained from remote sensing images and single-channel images of digital surface models, providing a comprehensive and robust feature representation for subsequent feature fusion and segmentation tasks. This structured multi-scale feature extraction method enables the model to better understand the image content, thereby improving the accuracy and robustness of remote sensing image segmentation.

[0050] In some embodiments of this application, the decoder includes four DecoderBlock blocks based on the CMTFNet network; Step S1230, which fuses the multi-scale fourth feature map through the decoder to obtain the segmentation result of the first image, includes: Step S1231: Input the fourth feature map of the fourth scale into the first DecoderBlock block to obtain the output feature map of the first DecoderBlock block; Step S1232: The fourth feature map of the third scale is fused with the output feature map of the first DecoderBlock block to obtain the first fused feature map, and the first fused feature map is input into the second DecoderBlock block to obtain the output feature map of the second DecoderBlock block; Step S1233: The fourth feature map of the second scale is fused with the output feature map of the second DecoderBlock block to obtain the second fused feature map, and the second fused feature map is input into the third DecoderBlock block to obtain the output feature map of the third DecoderBlock block; Step S1234: The fourth feature map of the first scale is fused with the output feature map of the third DecoderBlock block to obtain the third fused feature map, and the third fused feature map is input into the fourth DecoderBlock block to obtain the output feature map of the fourth DecoderBlock block. Step S1235: Based on the detection head, extract the segmentation result of the first image from the output feature map of the fourth DecoderBlock block.

[0051] The DecoderBlock uses a structure or processing logic specific to the CMTFNet (CNN and Multiscale Transformer Fusion Network).

[0052] The DecoderBlock is the basic processing unit in the decoder. Its function is to process and transform the input feature map and typically fuse it with corresponding scale features from the encoder. The DecoderBlock can contain convolutional layers, normalization layers, activation functions, and upsampling operations, aiming to gradually restore the spatial resolution of the feature map and enhance its semantic information. For example, the DecoderBlock can include an upsampling layer to double the resolution of the feature map, followed by one or more convolutional layers for feature extraction and refinement. Alternatively, it can contain a feature fusion module to merge the upsampled features with skip connection features from the encoder.

[0053] The fourth feature map at the fourth scale is the highest-level feature map output by the encoder, with the lowest resolution but the richest semantic information. It is input into the first DecoderBlock, which is the starting point of the decoding process. The DecoderBlock will perform preliminary processing on the highest-level features, such as upsampling and feature refinement, in preparation for fusion with the features of the next scale.

[0054] The fourth feature map at the third scale, the fourth feature map at the second scale, and the fourth feature map at the first scale are feature maps extracted by the encoder at different downsampling stages. Their resolution increases sequentially, semantic information gradually weakens, but spatial detail information gradually becomes richer. Fusing these fourth feature maps at different scales with the output feature map of the previous DecoderBlock is a key step for the decoder to achieve multi-scale information integration. This fusion typically involves aligning the low-resolution feature maps with the high-resolution feature maps through upsampling operations, and then merging them through methods such as concatenation or element-wise addition, thereby combining high-level semantic information with low-level spatial details.

[0055] The detection head is the final output layer of the segmentation model. The detection head usually consists of one or more convolutional layers, and the number of its output channels corresponds to the number of categories in the segmentation task. For example, the detection head can be a 1x1 convolutional layer that maps the number of channels in the feature map to the number of categories, and then generates a probability map of each pixel belonging to each category through the Softmax or Sigmoid activation function.

[0056] The decoder in this embodiment effectively integrates the multi-scale fourth feature maps extracted by the encoder, solving the problems of detail loss and boundary blurring that may exist in traditional decoders when processing multi-scale features. This top-down, hierarchical fusion strategy allows high-level semantic information to gradually guide the recovery of low-level spatial details, thereby accurately locating target boundaries while maintaining global contextual understanding. Specifically, by fusing fourth feature maps of different scales with upsampled feature maps, the model can fully utilize feature information from coarse to fine, avoiding the limitations of single-scale features. This enables the segmentation model to perform more refined and accurate segmentation of complex features in remote sensing images, especially when dealing with small targets and irregular boundaries, significantly improving segmentation accuracy and visual quality. Ultimately, the embodiment can generate high-quality remote sensing image segmentation results, providing reliable data support for subsequent geographic information analysis and applications.

[0057] In some embodiments of this application, the process of fusing the output feature maps of the DecoderBlock block corresponding to the fourth feature map of arbitrary scale to obtain the corresponding fused feature map in the above steps includes: Step S211: Perform average pooling and max pooling on the fourth feature map to obtain the average pooling feature map and the max pooling feature map. Step S212: Extract spatial attention feature maps from the average pooling feature maps and the max pooling feature maps based on the spatial attention mechanism; Step S213: Multiply the spatial attention feature map and the fourth feature map pixel by pixel to obtain the sixteenth feature map; Step S214: The sixteenth feature map and the output feature map are weighted and fused to obtain the seventeenth feature map; Step S215: Extract the fused feature map from the seventeenth feature map based on the depth-separable convolution.

[0058] In this embodiment, after the decoder fuses the multi-scale fourth feature map, it first performs average pooling and max pooling operations on the input fourth feature map (i.e., the encoder's output feature map) to capture contextual and saliency information from different dimensions. Based on these pooled features, a spatial attention feature map is generated through a spatial attention mechanism. This feature map can adaptively identify and highlight key spatial regions in the fourth feature map. Then, this spatial attention feature map is multiplied pixel-by-pixel with the original fourth feature map to spatially weight the fourth feature map. This allows the model to focus more on regions beneficial to the segmentation task and suppress interference from irrelevant regions. The spatially weighted sixteenth feature map is then fused with the output feature map of the DecoderBlock block. This fusion method allows the model to dynamically adjust the contributions of features according to their importance, ensuring effective information integration. The fused seventeenth feature map is processed using depthwise separable convolution to further extract high-level semantic information from the fused features. Through this series of refined fusion steps, this embodiment can more effectively integrate multi-scale features, enhance the expressive power of features, and optimize the information flow of the decoder, thereby overcoming the problems of information redundancy and loss of details that may be caused by simple fusion, and providing a more solid foundation for accurate segmentation of remote sensing images.

[0059] For ease of understanding, such as Figures 2 to 7 Some embodiments of this application provide a method for segmenting remote sensing images, the method comprising: Step S910: Construct a segmentation model; like Figure 2 As shown, the segmentation model is a multi-scale state-space hybrid network with a dual-branch encoder. The network uses a dual-parallel ResNet50 as its backbone architecture. The network has two branches, namely the RGB branch and the DSM branch, which extract remote sensing feature maps and DSM feature maps from remote sensing images and DSM single-channel images, respectively.

[0060] The DSM branch includes a feature enhancement module (represented by DFE in the figure), which explicitly reduces the conditional entropy at the object boundary by including high-fidelity elevation gradient information. Then there is a fusion module based on SE attention (represented by F in the figure), which performs weighted fusion of remote sensing image features and DSM features.

[0061] The segmentation model also incorporates a visual feature enhancement module (represented by VFE in the figure). The VFE module co-integrates multi-scale high-pass detail and low-pass global modeling into a unified full-frequency view, further increasing the proportion of effective features through attention. The VFE module also integrates the MSSM module, which effectively enhances multi-scale feature representation and long-range dependency modeling in remote sensing images through flattening and normalization, global context modeling via the MSSM module, and residual connections, combined with deep convolution and learnable scaling factors.

[0062] The segmentation model also incorporates an adaptive spatial attention fusion (ASAF) module. The ASAF module dynamically calibrates the feature alignment between the encoder's and decoder's output feature maps to achieve efficient cross-level feature integration. During decoding, the pyramid decoder structure, together with the ASAF module, is used for progressive feature fusion and spatial resolution recovery. The ASAF module generates spatial saliency weights through channel statistical compression, enabling adaptive location-level calibration of cross-scale and cross-modal features. This suppresses noise and modal conflicts while enhancing fusion robustness and boundary details.

[0063] Step S920: Train the segmentation model based on the model samples; First, let's introduce the VFE module; like Figure 3 As shown, the VFE module first flattens the 2D feature map and normalizes it through a one-dimensional layer. Then, the processed feature map is fed into the MSSM module for global context modeling, and the final result is reshaped into a 2D feature map, which achieves an effective transition from local features to global semantics. At the same time, the VFE module incorporates an efficient channel attention mechanism, which learns the global dependencies between channels through global average pooling and channel compression expansion structure, further enhancing the expressive power of important feature channels. Residual connections permeate the entire processing to ensure that the original feature information is not lost.

[0064] Specifically, the VFE module first extracts spatial features through 3x3 depthwise convolutions and introduces a learnable layer scaling factor to dynamically adjust the feature weights in the processing stage. This process can be represented as: (1); in, And the following and These are learnable parameters, and their values .

[0065] Next, long-range dependencies are captured using the MSSM module, spatial features are extracted using 3x3 convolution, and finally, feature enhancement is achieved using a feedforward neural network (FFN) and SE attention.

[0066] Specifically, FFN enhances feature representation through nonlinear transformation and dimensionality expansion using two fully connected layers, enabling the model to learn more complex feature representations. SE attention extracts global features through global average pooling and uses this global information to generate channel attention weights. This adaptively enhances feature responses in important channels while suppressing unimportant channels, achieving channel-level feature selection and enhancement. This process can be represented as: (2); (3); (4); (5); The VFE module in this embodiment, after multimodal feature fusion, has the ability to capture multi-scale representations and model long-range dependencies, which significantly improves the accuracy of semantic segmentation of remote sensing images.

[0067] The MSSM module is a core sub-component of the VFE module (i.e., in formula (3)). Based on the state-space duality principle, the MSSM modality provides a multi-scale feature processing mechanism. Its core lies in treating 3x3 and 5x5 separable convolutions as a set of efficient high-pass and band-pass filters specifically designed to capture high-frequency details and mid-frequency structures in images. Leveraging its global receptive field and linear recursive properties, it functions as an intelligent low-pass filter, effectively modeling long-term dependencies and macroscopic semantics with linear complexity. Through learnable gating weights, it achieves adaptive fusion and recombination of the entire spectral features from high to low frequencies. This design enables the MSSM module to comprehensively cover spectral information in remote sensing images, from local details to global layout.

[0068] The MSSM module establishes global connections between arbitrary pixel locations through matrix multiplication, enabling each pixel location to access information from the entire feature map. The state parameter A is normalized using the softmax function to ensure the effectiveness and numerical stability of the attention weights, thereby achieving global context modeling with linear computational complexity.

[0069] To control computational complexity, the MSSM module employs grouped convolutions and depthwise separable convolutions, reducing the parameter complexity from O(N²) to O(N), where N represents the sequence length. Further improvements in computational efficiency are achieved by limiting the state dimension to 16-32, keeping the computational complexity of global feature interactions at a linear level of O(N×d), where d represents the state dimension.

[0070] The MSSM workflow includes the following key steps: First, the input features are projected through a 1x1 convolution, and then multi-scale features are extracted in parallel using 3x3 and 5x5 depthwise separable convolution kernels. Subsequently, an adaptive gating mechanism generates weighting factors to dynamically adjust the importance of features at different scales, achieving adaptive feature fusion, as shown below: (6); (7); (8); (9); (10); in, and The sum equals 1.

[0071] The fused features are then decomposed into three key components of the state space (i.e., three first feature components): , and , Position-dependent soft attention weights are generated using the Softmax function and used to dynamically adjust the state transition matrix. This adaptive mechanism allows the model to flexibly adjust state transitions based on input. Subsequent feature interactions are achieved through matrix multiplication, and the process can be described as follows: (11); (12); (13); Then channel attention is applied to enhance important features: (14); Then, the features are further processed through projection and gating mechanisms: (15); (16); in, , These are learnable parameters.

[0072] Finally, these features are processed through the output projection layer and combined with the third component. Interact to generate the final output features of the MSSM module: (17); Then the output of formula (17) is used as the output of formula (3). .

[0073] The ASAF module is described below; As a lightweight filter, the ASAF module first performs statistical compression on the channel information to form a compact spatial description. Then, it learns to generate spatial saliency weights for location-level recalibration of features. This prioritizes preserving locations with greater information content and consistency during the fusion process, while suppressing noise and modal conflict regions. This transforms cross-scale, cross-modal feature interaction from indiscriminate fusion to saliency-based adaptive guided fusion, enhancing the robustness of the fusion and the representation of boundary details.

[0074] Specifically, for the input encoder features and decoder features First use The statistical information is used to generate a spatial attention map for feature enhancement, and the process is as follows: (18); (19); in, and These represent the average and maximum values ​​along the channel dimension, respectively.

[0075] Then, learnable weight parameters and weighted fusion are used to dynamically adjust the relative importance of different features. Finally, depthwise separable convolution is used for post-processing to further fuse features and output the result. (20); (twenty one); in, These are learnable weight parameters.

[0076] The ASAF module achieves spatial feature recalibration and adaptive fusion by statistically compressing channel information and learning spatial saliency weights. The strategy enhances the fusion process by emphasizing responses from regions with higher information content and stronger modal consistency, while suppressing interference from noise and conflicting features. Therefore, it improves the robustness of cross-scale feature fusion and enhances the representation of boundary details.

[0077] The following describes the DFE module; The DFE module preprocesses the DSM image before feature extraction, calculating the Sobel gradient magnitude of the DSM image and injecting it as a structural cue to emphasize geometric discontinuities. This spatially focuses the input on potential object boundaries and shape contours. Since the cues are derived from fixed operators rather than additional learnable parameters, the DFE module not only aims to enhance edges but also introduces strong inductive bias, limiting the hypothesis space the model uses to learn object boundaries. This approach reduces the complexity of boundary modeling and suppresses instabilities caused by appearance noise. Specifically, the DFE module first calculates the horizontal and vertical gradients, as follows: (twenty two); (twenty three); in, express Operator operations, and They represent along direction and Partial derivatives in direction.

[0078] The calculated gradient magnitudes are then normalized to the range [0,1] and finally fused with the DSM image using a weighted method. The specific process is as follows: (twenty four); (25); (26); in, This is a weighting coefficient used to control the degree of edge enhancement.

[0079] By injecting structural cue, the DFE model explicitly enhances discriminative cues at geometric discontinuities in DSM images, shifting the input distribution from intensity-driven to structure-driven. This not only improves the separability and stability of boundary-related information in early representations but also provides higher signal-to-noise ratio and task-related prior guidance for subsequent feature extraction and cross-modal fusion.

[0080] The loss function is introduced below; To better address class imbalance and boundary blurring issues in remote sensing images, the network (i.e., the segmentation model) employs a composite loss function strategy, with the total loss function expressed as: (27); It is the main output loss. It is an auxiliary output loss. To assist in loss weighting.

[0081] (28); in, Cross-entropy loss is used as a basic loss function to optimize two-dimensional semantic segmentation tasks; It is used to address the class imbalance problem, mitigate the influence of common classes, and enhance the learning of rare classes; As a boundary-aware loss function, it focuses on optimizing the boundary region to improve the accuracy of boundary segmentation. , They are respectively and The weighting coefficients.

[0082] The auxiliary output uses a simplified version of the composite loss: (29); Step S930: Segment the remote sensing image based on the trained segmentation model.

[0083] Step S9310: Possess remote sensing images and corresponding DSM images; Step S9320: The remote sensing image is input into the RGB branch of the encoder. The first feature map of the first scale is extracted through the CBR block, and then input into three ResBlock stacked blocks to obtain the first feature map of the second scale. Then it is input into four ResBlock stacked blocks to obtain the first feature map of the third scale. Then it is input into six ResBlock stacked blocks to obtain the first feature map of the fourth scale. Step S9330: The DSM image is input into the DSM branch of the encoder. First, the DSM image is enhanced by the DFE module. The second feature map of the first scale is extracted by the CBR block. Then, it is input into three ResBlock stacked blocks to obtain the second feature map of the second scale. Then, it is input into four ResBlock stacked blocks to obtain the second feature map of the third scale. Finally, it is input into six ResBlock stacked blocks to obtain the second feature map of the fourth scale. Step S9340: The first feature map at the first scale and the second feature map at the first scale are fused to obtain the third feature map at the first scale; similarly, the third feature map at the second scale to the third feature map at the fourth scale are obtained.

[0084] In step S9350, the third feature map of the first scale is input into the VFE module (including the MSSM module) to obtain the output fourth feature map of the first scale; similarly, the fourth feature map of the second scale to the fourth feature map of the fourth scale are obtained.

[0085] In step S9360, in the encoder, the fourth feature map of the fourth scale is input into the DecoderBlock1 module to obtain the corresponding output feature map, which is then compared with the fourth feature map of the third scale. Figure 1 The same input is fed into ASAF to obtain the corresponding fused feature map, and then fed into the DecoderBlock2 module, and so on, until the fused feature map output by the last ASAF block is obtained; Step S9370: The fused feature map is input into the detection head to obtain the segmented remote sensing image output by the detection head.

[0086] This embodiment has at least the following beneficial effects: The network provided in this embodiment has a dual-stream architecture and a multi-level feature fusion strategy. By introducing an efficient VFE module, which integrates the MSSM module, parallel multi-scale convolution is used as a high-pass filter to extract local texture details, and a state space model is combined as an efficient low-pass filter to model global semantic dependencies. The architecture achieves full-frequency feature fusion through linear computational complexity, which significantly reduces the representation challenges caused by extreme scale changes in remote sensing images. The network provided in this embodiment also proposes an ASAF module, which uses statistical matrices (mean and maximum value) as indicators of feature saliency. The module dynamically calibrates the feature alignment between the encoder and decoder to achieve efficient fusion of cross-level features. In addition, a DFE module is introduced, which explicitly reduces the conditional entropy at object boundaries by combining high-fidelity elevation gradient information, thus solving the semantic ambiguity problem in remote sensing images of shadowed and spectrally similar object regions.

[0087] like Figures 5 to 7 The following is a set of experimental examples, as follows: Dataset; 1) ISPRS Vaihingen: The dataset consists of 16 orthophotos (2000×2500 pixels) and a corresponding normalized digital surface model, both at a ground sampling distance of 9 cm.

[0088] 2) ISPRS Potsdam: It consists of 24 orthophotos (6000×6000 pixels) and a normalized DSM with higher resolution at a ground sampling distance of 5cm.

[0089] During training, 10,000 256×256 pixel patches were randomly extracted. For inference, a sliding window method was used, with a step size of 32 for Vaihingen and 64 for Potsdam, to balance segmentation accuracy and testing efficiency.

[0090] Evaluation indicators; To evaluate the proposed model, three standard metrics were used: Overall Accuracy (OA), Mean Intersection Value (mIoU), and Mean F1 Score (mF1). These metrics were calculated from the cumulative confusion matrix and can be directly compared with popular methods. The calculation formulas are as follows: (30); (31); (32); (33); (34); (35); in For the number of categories, , and The first The corresponding true positives, false positives, and false negatives of the class.

[0091] Experiment setup; The model was trained on a single NVIDIA GeForce RTX 3080 GPU within the PyTorch framework, with the training process consisting of 50 iterations optimized by SGD (batch size = 10, learning rate = 0.01, momentum = 0.9, weight decay = 0.0005). After extracting samples through a sliding window, data augmentation was performed through random rotations and flips. The encoder used a ResNet50 backbone network pre-trained on ImageNet, while the decoder weights were randomly initialized.

[0092] Performance comparison; exist Figure 5 In the diagram, (a) to (o) are respectively: (a) NIRRG image, (b) DSM, (c) ground truth label, (d) ABCNet, (e) TransUNet, (f) UNetFormer, (g) MAResU-Net, (h) CMTFNet, (i) vFuseNet, (j) SA-GATE, (k) ESANet, (l) CMGFNet, (m) CMFNet, (n) SGFNet, and (o) the network of this embodiment; exist Figure 6In the diagram, (a) to (o) are respectively: (a) RGB image, (b) DSM, (c) ground truth label, (d) ABCNet, (e) TransUNet, (f) UNetFormer, (g) MAResU-Net, (h) CMTFNet, (i) vFuseNet, (j) SA-GATE, (k) ESANet, (l) CMGFNet, (m) CMFNet, (n) SGFNet, and (o) the network of this embodiment; exist Figure 7 In the diagram, (a) to (o) are respectively: (a) NIRRG image, (b) DSM, (c) ground truth label, (d) baseline model, (e) DFE module, (f) VFE module, (g) ASAF module, (h) DFE + VFE, (i) DFE + ASAF, (j) VFE + ASAF, and (k) the network of this embodiment.

[0093] A comprehensive comparative study was conducted, comparing the proposed segmentation model with 11 currently popular techniques. The selected comparison methods included RGB semantic segmentation methods, namely ABCNet, TransUNet, UNetFormer, MAResUNet, and CMTFNet; and RGB-DSM multimodal semantic segmentation methods, namely vFuseNet, SA-GATE, ESANet, CMGFNet, CMFNet, and SGFNet.

[0094] 1) The performance of the network (segmentation model) on the ISPRSVaihingen dataset was quantitatively compared with existing methods in Table 1. This embodiment shows leading results, performing excellently in mF1, mIoU, and OA. Specifically, the mIoU is 84.14%, which is 0.86% higher than the second-best method.

[0095] Table 1

[0096] 2) Evaluation on the larger ISPRS Potsdam dataset further demonstrates the effectiveness of the method. As shown in Table 2, the network (segmentation model) outperforms other methods in terms of mF1, mIoU, and OA metrics. Notably, it achieves the highest mIoU of 86.21%, which is 0.78% higher than the second-best method.

[0097] Table 2

[0098] Ablation experiment; A series of ablation experiments were conducted to systematically evaluate the contribution and overall impact of each module in the network (segmentation model) on the Vaihingen dataset. The experiments used CMTFNet with ResNet50 as the baseline model, and the SE attention fusion module was used to fuse RGB and DSM features. Test modules included DFE, VFE, and ASAF; the experimental results are summarized in Table 3. The baseline model had an mIoU of 83.23%. Adding DFE, VFE, and ASAF modules respectively increased mIoU by 0.29%, 0.36%, and 0.25%. In module combination validation, compared to the baseline, the combinations of DFE+VFE, DFE+ASAF, and VFE+ASAF improved performance by 0.54%, 0.50%, and 0.60%, respectively. When all three were integrated, the performance improved by 0.91%.

[0099] Table 3

[0100] Baseline models have limited ability to capture multi-scale contextual information and establish long-range dependencies, leading to inaccurate identification of small-scale objects. In contrast, the complete model, which includes all modules, effectively utilizes multi-scale feature representations and complementary information. This ensemble approach significantly enhances the model's ability to interpret complex remote sensing scenes and improves the boundary accuracy of object segmentation.

[0101] Model complexity analysis; To evaluate the computational efficiency of the network (segmentation model), three key metrics were used: floating-point operations (FLOPs), the number of model parameters, and frames per second (FPS). FLOPs represent computational complexity, while the number of model parameters reflects memory requirements; lower values ​​generally indicate higher efficiency. FPS directly measures inference speed; higher values ​​indicate faster processing speed. As shown in Table 4, compared with existing methods, the network (segmentation model) achieved significant segmentation results while ensuring lower parameters and computational complexity, demonstrating its efficiency in multimodal semantic segmentation tasks.

[0102] Table 4

[0103] like Figure 8 One embodiment of this application provides a segmentation apparatus for remote sensing images, the apparatus comprising: The image acquisition module 1100 is used to determine a first image and a second image; wherein the first image is a remote sensing image and the second image is a single-channel image of a digital surface model corresponding to the first image; The image segmentation module 1200 is used to input a first image and a second image into a segmentation model to obtain a segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: The encoder extracts a multi-scale first feature map from the first image and a multi-scale second feature map from the second image, and fuses the first and second feature maps of the corresponding scales to obtain a multi-scale third feature map. A multi-scale fourth feature map is extracted from a multi-scale third feature map using an encoder. The process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map using depthwise convolution and adding the first intermediate feature map to the third feature map pixel-by-pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map using a 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map using a 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map using a 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps using an adaptive gating mechanism, and then adding weights according to the adaptive weights. The fourth and fifth intermediate feature maps are fused to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. The segmentation result of the first image is obtained by fusing the multi-scale fourth feature map through the decoder.

[0104] It should be noted that the remote sensing image segmentation device provided in this embodiment is based on the same inventive concept as the remote sensing image segmentation method described above. Therefore, the content of the remote sensing image segmentation method in the above embodiment is also applicable to the content of the remote sensing image segmentation device in this embodiment, and will not be repeated here.

[0105] like Figure 9 One embodiment of this application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned remote sensing image segmentation method. The electronic device includes: At least one battery; At least one memory; At least one processor; At least one program; The program is stored in memory, and the processor executes at least one program to implement the remote sensing image segmentation method described above in this disclosure.

[0106] Electronic devices can be any smart terminal, including mobile phones, tablets, personal digital assistants (PDAs), and in-vehicle computers.

[0107] The electronic devices according to embodiments of this application will now be described in detail.

[0108] The processor 1600 can be implemented using a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this disclosure. The memory 1700 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1700 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1700 and is called and executed by the processor 1600 to perform a remote sensing image segmentation method according to an embodiment of this disclosure.

[0109] The input / output interface 1800 is used to implement information input and output. The communication interface 1900 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.). Bus 2000 transmits information between various components of the device (e.g., processor 1600, memory 1700, input / output interface 1800, and communication interface 1900); The processor 1600, memory 1700, input / output interface 1800 and communication interface 1900 are connected to each other within the device via bus 2000.

[0110] This disclosure also provides a storage medium, which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute the detection method of the pressurized water reactor containment pressure control system described above.

[0111] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0112] The above is a detailed description of the preferred embodiments of this application. However, the embodiments of this application are not limited to the above-described implementation methods. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the embodiments of this application. All such equivalent modifications or substitutions are included within the scope defined by the claims of the embodiments of this application.

Claims

1. A method for segmenting remotely sensed images, characterized in that, The method includes: Determine a first image and a second image; wherein the first image is a remote sensing image, and the second image is a single-channel image of a digital surface model corresponding to the first image; The first image and the second image are input into a segmentation model to obtain the segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: The encoder extracts a multi-scale first feature map from the first image and a multi-scale second feature map from the second image, and fuses the first and second feature maps of corresponding scales to obtain a multi-scale third feature map. The encoder extracts a multi-scale fourth feature map from the multi-scale third feature map. The process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map based on depthwise convolution and adding the first intermediate feature map to the third feature map pixel-by-pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map based on a 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map based on a 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map based on a 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps based on an adaptive gating mechanism, and weighting and fusing the fourth and fifth intermediate feature maps according to the adaptive weights. The fourth and fifth intermediate feature maps are used to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. The segmentation result of the first image is obtained by fusing the multi-scale fourth feature map through the decoder.

2. The method for segmenting remote sensing images according to claim 1, characterized in that, The step of generating the fourth feature map based on the tenth intermediate feature map includes: The residuals of the tenth intermediate feature map and the second intermediate feature map are concatenated to obtain the eleventh intermediate feature map; The twelfth intermediate feature map is extracted from the eleventh intermediate feature map using a 3x3 convolution, and the residuals of the twelfth intermediate feature map and the eleventh intermediate feature map are concatenated to obtain the thirteenth intermediate feature map. The fourteenth intermediate feature map is extracted from the thirteenth intermediate feature map using a feedforward neural network, and the fourteenth intermediate feature map is enhanced using the SE attention mechanism to obtain the fifteenth intermediate feature map. The residuals of the fifteenth intermediate feature map and the thirteenth intermediate feature map are concatenated to obtain the fourth feature map.

3. The method for segmenting remote sensing images according to claim 1, characterized in that, The step of extracting a multi-scale second feature map from the second image includes: The horizontal and vertical gradients of the second image are extracted based on the Sobel operator; The horizontal and vertical gradients are normalized to obtain the gradient magnitude; The enhanced second image is obtained by fusing the gradient magnitude with the second image. A second feature map at a first scale is extracted from the enhanced second image based on a CBR block; the CBR block includes cascaded convolutional layers, BN layers, and ReLU activation functions; Based on the feature extraction blocks, a second feature map of a higher scale than the first scale is extracted from the second feature map of the first scale; The step of extracting a multi-scale first feature map from the first image includes: Based on the CBR block, a first feature map of the first scale is extracted from the first image; Based on the feature extraction block, a first feature map of a higher scale than the first scale is extracted from the first feature map of the first scale.

4. The method for segmenting remote sensing images according to claim 3, characterized in that, The feature extraction block includes a ResBlock block; the ResBlock block includes a cascaded 1x1 convolutional layer, a 3x3 convolutional layer, and a 1x1 convolutional layer; The step of extracting a second feature map of a higher scale than the first scale from the second feature map at the first scale based on feature extraction blocks includes: The second feature map at the second scale is extracted from the second feature map at the first scale based on the three stacked ResBlock blocks; The second feature map at the third scale is extracted from the second feature map at the second scale based on the four stacked ResBlock blocks; The second feature map at the fourth scale is extracted from the second feature map at the third scale based on the six stacked ResBlock blocks; the scales from the first scale to the fourth scale gradually increase. The step of extracting a first feature map at a higher scale than the first scale from the first feature map at the first scale based on feature extraction blocks includes: The first feature map at the second scale is extracted from the first feature map at the first scale based on the three stacked ResBlock blocks; The first feature map at the third scale is extracted from the first feature map at the second scale based on the four stacked ResBlock blocks; The first feature map of the fourth scale is extracted from the first feature map of the third scale based on the six stacked ResBlock blocks.

5. The method for segmenting remote sensing images according to claim 4, characterized in that, The decoder includes four DecoderBlock blocks based on the CMTFNet network; The step of fusing the multi-scale fourth feature map through the decoder to obtain the segmentation result of the first image includes: The fourth feature map of the fourth scale is input into the first DecoderBlock block to obtain the output feature map of the first DecoderBlock block; The fourth feature map at the third scale is fused with the output feature map of the first DecoderBlock to obtain the first fused feature map, and the first fused feature map is input into the second DecoderBlock to obtain the output feature map of the second DecoderBlock. The fourth feature map of the second scale is fused with the output feature map of the second DecoderBlock to obtain the second fused feature map, and the second fused feature map is input into the third DecoderBlock to obtain the output feature map of the third DecoderBlock. The fourth feature map of the first scale is fused with the output feature map of the third DecoderBlock to obtain the third fused feature map, and the third fused feature map is input into the fourth DecoderBlock to obtain the output feature map of the fourth DecoderBlock. Based on the detection head, the segmentation result of the first image is extracted from the output feature map of the fourth DecoderBlock block.

6. The method for segmenting remote sensing images according to claim 5, characterized in that, The process of fusing the output feature maps of the corresponding DecoderBlock block to obtain the corresponding fused feature map includes: The fourth feature map is subjected to average pooling and max pooling respectively to obtain the average pooling feature map and the max pooling feature map; Spatial attention feature maps are extracted from the average pooling feature maps and the max pooling feature maps based on the spatial attention mechanism; The spatial attention feature map and the fourth feature map are multiplied pixel by pixel to obtain the sixteenth feature map; The sixteenth feature map and the output feature map are weighted and fused to obtain the seventeenth feature map; Based on depthwise separable convolution, a fused feature map is extracted from the seventeenth feature map.

7. The method for segmenting remote sensing images according to claim 6, characterized in that, The loss function of the segmentation model during training includes: ; ; ; in, Let be the loss function of the segmentation model. Main output loss, To assist in output loss, To assist in loss weighting, For cross-entropy loss, For class imbalance loss, For boundary-aware loss, The weights are the corresponding losses.

8. A segmentation device for remote sensing images, characterized in that, The device includes: An image acquisition module is used to determine a first image and a second image; wherein the first image is a remote sensing image, and the second image is a single-channel image of a digital surface model corresponding to the first image; An image segmentation module is used to input the first image and the second image into a segmentation model to obtain a segmentation result of the first image output by the segmentation model; wherein, the segmentation model includes an encoder and a decoder, and the process of the segmentation model outputting the segmentation result of the first image includes: The encoder extracts a multi-scale first feature map from the first image and a multi-scale second feature map from the second image, and fuses the first and second feature maps of corresponding scales to obtain a multi-scale third feature map. The encoder extracts a multi-scale fourth feature map from the multi-scale third feature map. The process of extracting the fourth feature map at any scale includes: extracting a first intermediate feature map from the third feature map based on depthwise convolution and adding the first intermediate feature map to the third feature map pixel-by-pixel to obtain a second intermediate feature map; extracting a third intermediate feature map from the second intermediate feature map based on a 1x1 convolution, extracting a fourth intermediate feature map from the third intermediate feature map based on a 3x3 depthwise convolution, and extracting a fifth intermediate feature map from the third intermediate feature map based on a 5x5 depthwise convolution; generating adaptive weights for the fourth and fifth intermediate feature maps based on an adaptive gating mechanism, and weighting and fusing the fourth and fifth intermediate feature maps according to the adaptive weights. The fourth and fifth intermediate feature maps are used to obtain the sixth intermediate feature map. The tensor of the sixth intermediate feature map is decomposed into the first feature component of the state space. Position-dependent soft attention weights are generated based on the first feature component, and a state transition matrix is ​​generated based on the soft attention weights. The seventh intermediate feature map is generated based on the state transition matrix, the first feature component, and the first intermediate feature map. The seventh intermediate feature map is enhanced based on the channel attention mechanism to obtain the eighth intermediate feature map. The ninth intermediate feature map is extracted from the eighth intermediate feature map based on the gating mechanism. The ninth intermediate feature map and the first feature component are interacted based on the projection mechanism to obtain the tenth intermediate feature map. The fourth feature map is generated based on the tenth intermediate feature map. The segmentation result of the first image is obtained by fusing the multi-scale fourth feature map through the decoder.

9. An electronic device, characterized in that, It includes at least one controller and a memory for communicatively connecting with the controller; the memory stores instructions executable by the at least one controller, the instructions being executed by the at least one controller to cause the at least one controller to perform a remote sensing image segmentation method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores computer-executable instructions for causing a computer to perform a remote sensing image segmentation method as described in any one of claims 1 to 7.