A remote sensing image segmentation method and related device
By designing an encoder-decoder architecture and combining a multi-scale convolutional attention network and a frequency-sensitive attention module, the problem of long-distance spatial dependence and multi-scale feature fusion in remote sensing images was solved, and high-precision remote sensing image segmentation was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV OF TECH
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional CNN models struggle to effectively model the long-distance spatial dependencies and complex contextual information that are prevalent in remote sensing images, and their multi-scale feature fusion capabilities are insufficient, resulting in inadequate segmentation accuracy and difficulty in identifying small targets in remote sensing images.
An encoder-decoder architecture is adopted, which combines a multi-scale convolutional attention network, a frequency-sensitive attention module, a context-aware Transformer block, and a multilayer perceptron. Multi-scale feature extraction and fusion are achieved through skip connections and cascaded processing, which enhances high-frequency detail information and improves segmentation accuracy.
It significantly improves the accuracy and robustness of remote sensing image segmentation, especially the ability to characterize the boundaries of ground features in complex scenes and the accuracy of small target recognition, while maintaining low computational cost.
Smart Images

Figure CN122265645A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of computer vision and remote sensing image processing technology, and in particular relates to a remote sensing image segmentation method and related equipment. Background Technology
[0002] Semantic segmentation of remote sensing images is a key technology that assigns specific land cover category labels (such as buildings, roads, vegetation, water bodies, etc.) to each pixel in high-resolution remote sensing images. It has broad application prospects in fields such as urban planning, environmental monitoring, disaster assessment, and land cover classification. In recent years, with the rapid development of deep learning technology, deep learning-based methods have made significant progress in the task of semantic segmentation of remote sensing images.
[0003] However, traditional CNN models primarily rely on local receptive fields for feature extraction, making it difficult to effectively model the long-range spatial dependencies and complex contextual information widely present in remote sensing images. For example, in high-resolution remote sensing images, the same type of land cover often spans a large spatial area, and their semantic consistency requires global contextual understanding for accurate identification. Furthermore, remote sensing images typically contain multi-scale targets, placing higher demands on the model's multi-scale feature fusion capabilities.
[0004] Therefore, there is an urgent need for a remote sensing image segmentation method and related equipment that can efficiently integrate local details and global context and adapt to the multi-scale characteristics of remote sensing images. Summary of the Invention
[0005] In view of the shortcomings of the prior art, the purpose of this application is to provide a remote sensing image segmentation method and related equipment, the method comprising the following steps: The remote sensing image to be segmented is acquired, and the trained deep learning network model is used to obtain the remote sensing image semantic segmentation result of the remote sensing image to be segmented. The deep learning network model includes an encoder-decoder architecture. The encoder is a feature extraction encoder based on a multi-scale convolutional attention network, used to extract multi-scale features. The decoder includes a frequency-sensitive attention module, a context-aware Transformer block, and a decoder head based on a multilayer perceptron. The output feature maps of different levels of the encoder are connected by skip connections and input into the corresponding frequency-sensitive attention module for frequency decomposition to obtain output information after enhancing high-frequency components. The output information of the frequency-sensitive attention module is input into the stacked context-aware Transformer blocks for cascaded context enhancement processing to form multi-level decoded features. The decoder head is used to fuse the multi-level decoded features through a multilayer perceptron to obtain the semantic segmentation result of the remote sensing image.
[0006] Preferably, the training process of the deep learning network model includes: A training dataset is constructed by acquiring remote sensing image data containing target features and preprocessing it. Pixel-level semantic annotation is performed on the preprocessed remote sensing image data to generate label data. The remote sensing image data and the corresponding label data are divided into a training set and a validation set. Data augmentation is performed on the data in the training set, including one or more of random scaling, random horizontal flipping, random vertical flipping, and random rotation. The deep learning network model is iteratively trained based on the data-augmented training dataset, and the model performance during the training process is verified using the validation set.
[0007] Preferably, the frequency-sensitive attention module is used to explicitly decompose the input feature map into high-frequency components and low-frequency components, assign weights to the high-frequency components and low-frequency components according to the local energy response and perform weighted summation, and fuse the weighted high-frequency components and low-frequency components to selectively enhance key frequency information.
[0008] Preferably, the context-aware Transformer block includes a global-local hybrid attention layer and a gated convolutional feedforward network layer; the global-local hybrid attention layer is used to capture global contextual dependencies and local detailed features; the gated convolutional feedforward network layer is used to selectively propagate and enhance features.
[0009] Preferably, the global-local hybrid attention layer includes a global attention branch and a local attention branch; the global attention branch calculates global attention weights based on a self-attention mechanism; the local attention branch extracts local features and generates local attention weights through point convolution and depth convolution operations; the global-local hybrid attention layer is also used to add the global attention weights and the local attention weights to obtain a hybrid attention output, which is used as the input feature of the gated convolutional feedforward network layer.
[0010] Preferably, the gated convolutional feedforward network layer includes a first processing branch and a second processing branch; the first processing branch generates intermediate features through cascaded point convolutional layers and deep convolutional layers; the second processing branch generates gate weights through cascaded point convolutional layers, deep convolutional layers, and non-activation function layers; the gate weights are multiplied element-wise with the intermediate features to obtain the enhanced output features.
[0011] Preferably, the decoding head is used to fuse the multi-level decoding features through a multi-layer perceptron to obtain the semantic segmentation result of the remote sensing image, including: The multi-level decoding features are channel aligned to obtain a multi-level feature map with the same number of channels; The multi-level feature maps are upsampled using bilinear interpolation to obtain multi-level feature maps of the same size; The multi-level feature maps of the same size are added and fused together, and then convolutional mapping is performed to obtain the semantic segmentation result of the remote sensing image.
[0012] A second aspect of this application provides an electronic device including one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors according to the method described in the first embodiment.
[0013] A third aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by at least one processor, implements the steps of the method described in any of the preceding claims.
[0014] A fourth aspect of this application provides a computer program product comprising a computer program that, when executed by at least one processor, implements the steps of the method described in any one of the claims.
[0015] This application provides a remote sensing image segmentation method and related equipment. By designing a novel encoder-decoder architecture, it effectively integrates the local feature extraction capabilities of CNNs and the global context modeling capabilities of Transformers. By reintroducing local inductive bias through a gated convolutional feedforward network, it significantly improves the ability to delineate and recognize the boundaries of fine-grained features (such as roads, rivers, and buildings) in remote sensing images. Furthermore, it introduces a frequency-sensitive attention mechanism to explicitly enhance high-frequency detail information that is crucial for the segmentation task. Attached Figure Description
[0016] The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Throughout the drawings, the same reference numerals denote the same components. Obviously, the drawings described below are merely some embodiments described in this application, and those skilled in the art can obtain other drawings based on these drawings.
[0017] Figure 1 This is a schematic diagram of a frequency-sensitive context-aware network model structure provided in an embodiment of this application.
[0018] Figure 2 This is a schematic diagram of a frequency-sensitive attention module structure provided in an embodiment of this application.
[0019] Figure 3 This is a schematic diagram of a context-aware Transformer block structure provided in an embodiment of this application.
[0020] Figure 4 This is a schematic diagram of a global-local hybrid attention and gated convolutional feedforward network structure provided in an embodiment of this application.
[0021] Figure 5 This is a schematic diagram illustrating the semantic segmentation effect of a remote sensing image using the method described in the embodiments of this application. Detailed Implementation
[0022] To enable those skilled in the art to better understand the technical solutions in the embodiments of this application, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. It should be understood that these descriptions are exemplary only and are not intended to limit the scope of this application. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this application. Furthermore, in the following description, descriptions of well-known structures and technologies are omitted to avoid unnecessarily obscuring the concepts disclosed in this application.
[0023] Semantic segmentation of remote sensing images faces many challenges: 1) Extreme variations in the scale of ground features, especially some key but small targets (such as vehicles, small buildings, specific facilities, etc.) are difficult to identify effectively; 2) Dense layout and blurred boundaries of ground features in complex scenes lead to inaccurate segmentation results; 3) High-resolution image processing places stringent demands on computational efficiency.
[0024] Related technologies apply CNN models to semantic segmentation tasks of remote sensing images. Traditional CNN models mainly rely on local receptive fields for feature extraction. To overcome the limitations of traditional CNN models, which rely primarily on local receptive fields for feature extraction and are difficult to effectively model the long-range spatial dependencies and complex contextual information widely present in remote sensing images, the Transformer architecture can be introduced into remote sensing image segmentation tasks. Although the Transformer, with its self-attention mechanism, can theoretically capture the global dependencies between any two pixels in an image, directly applying the standard Vision Transformer to remote sensing image segmentation still faces challenges: on the one hand, remote sensing images have high resolution and large data volumes, leading to a sharp increase in computational complexity; on the other hand, the original Transformer lacks the ability to finely model local details and edge structures, easily resulting in blurred segmentation boundaries or missed detection of small targets.
[0025] In related technologies, some methods attempt to balance local and global features by introducing pyramid structures, dilated convolutions, or hybrid CNN-Transformer architectures. However, when dealing with complex scenes unique to remote sensing images (such as dense building clusters, irregular farmland, etc.), there are still problems such as insufficient utilization of contextual information, redundant feature representations, or low computational efficiency.
[0026] This application proposes a remote sensing image segmentation method and related equipment, which can achieve efficient semantic segmentation that integrates local details and global context and adapts to the multi-scale characteristics of remote sensing images, thereby improving segmentation accuracy and robustness.
[0027] The technical solutions of the embodiments of this application and how the technical solutions of the embodiments of this application solve the above-mentioned technical problems will be described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the various embodiments or technical features described below can be arbitrarily combined to form new embodiments. The order of description of the embodiments below is not intended to limit the preferred order of embodiments. The same or similar concepts or processes may not be described again in some embodiments. Obviously, the described embodiments are some embodiments of the embodiments of this application, but not all embodiments.
[0028] Example 1.
[0029] This embodiment provides a remote sensing image segmentation method, which includes the following steps: The remote sensing image to be segmented is acquired, and the trained deep learning network model is used to obtain the remote sensing image semantic segmentation result of the remote sensing image to be segmented. The deep learning network model includes an encoder-decoder architecture. The encoder is a feature extraction encoder based on a multi-scale convolutional attention network, used to extract multi-scale features. The decoder includes a frequency-sensitive attention module, a context-aware Transformer block (context-aware module), and a decoder head based on a multilayer perceptron. The output feature maps of different levels of the encoder are connected by skip connections and input into the corresponding frequency-sensitive attention module for frequency decomposition to obtain output information after enhancing high-frequency components. The output information of the frequency-sensitive attention module is input into the stacked context-aware Transformer blocks for cascaded context enhancement processing to form multi-level decoded features. The decoder head is used to fuse the multi-level decoded features through a multilayer perceptron to obtain the semantic segmentation result of the remote sensing image.
[0030] It can be understood that the remote sensing image segmentation method provided in this embodiment is based on a deep learning network model (i.e., a frequency-sensitive context-aware network model), and aims to solve the problem of insufficient segmentation accuracy caused by large scale variations, blurred boundaries, and difficulty in identifying small targets when processing remote sensing images.
[0031] Through the frequency-sensitive attention module, the model can selectively enhance high-frequency information that is crucial to segmentation boundaries, effectively alleviating the problem of blurred ground feature boundaries in complex scenes and significantly improving segmentation integrity. The context-aware Transformer block can combine global attention and local convolution operations to achieve powerful context modeling capabilities while maintaining low computational cost.
[0032] To address the challenge of modeling long-range spatial dependencies and complex contextual information, this method progressively expands the receptive field by stacking multiple Transformer blocks, effectively modeling the semantic consistency of similar land features spanning a large spatial area in remote sensing images. Simultaneously, context enhancement processing refines and enhances global contextual information, avoiding the limitations of traditional CNNs that cannot understand complex contextual relationships due to their local receptive field.
[0033] To address the issue of insufficient multi-scale feature fusion capability, a multi-scale convolutional attention network is used as the backbone of the encoder, enabling multi-scale capture capability during the feature extraction stage. The output feature maps from different levels of the encoder are introduced into the decoder simultaneously through skip connections, combining shallow details with deep semantics. The decoder head fuses these multi-level decoded features using a multi-layer perceptron, achieving effective integration of features at different scales. This end-to-end multi-scale processing approach, from extraction to fusion, overcomes the limitations of single-scale processing.
[0034] In an exemplary embodiment, the training process of the deep learning network model includes: S101, construct a training dataset, acquire remote sensing image data containing target objects and preprocess it, perform pixel-level semantic annotation on the preprocessed remote sensing image data to generate label data, and divide the remote sensing image data and the corresponding label data into a training set and a validation set. As an example, the process of constructing a training dataset may include: acquiring and filtering remote sensing image data containing the features to be segmented, and pre-processing it into a data format suitable for computer processing and subsequent steps. For the pre-processed remote sensing image data, label data is obtained through manual annotation. The remote sensing image data and label data are then cropped to a uniform size of 512×512 and divided into training and validation sets according to a certain ratio.
[0035] S102, perform data augmentation on the data in the training set, the data augmentation including one or more of random scaling, random horizontal flipping, random vertical flipping, and random rotation; As an example, the process of constructing the data augmentation pipeline includes data augmentation used on the dataset during training, including random scaling, random flipping, and random rotation. Specifically, images are randomly scaled between a factor of 0.5 and 2, then horizontally flipped, vertically flipped, and randomly rotated around a 90° reference angle with a probability of 0.5, and finally cropped to a size of 512×512. The dataset is augmented with data augmentation during training.
[0036] S103, the deep learning network model is iteratively trained based on the data-augmented training dataset, and the model performance during the training process is verified using the validation set.
[0037] Based on the constructed training dataset, a deep learning network model (frequency-sensitive context-aware network model) is trained and validated. The structure of the deep learning network model can be found in [reference needed]. Figure 1 The model adopts an encoder-decoder architecture.
[0038] The encoder (High-Efficiency Hierarchical Convolutional Attention Encoder) is a feature extraction encoder based on a Multi-Scale Convolutional Attention Network (MSCAN). The decoder (Multi-Scale Context Enhancement Encoder with Frequency-Sensitive Optimization) consists of three parts: a Frequency-Sensitive Attention (FSA) module, a Context-Aware Transformer Block (CATB), and a multilayer perceptron-based decoding head. Feature maps from different levels of the encoder are input to the decoder via skip connections.
[0039] In an exemplary embodiment, the frequency-sensitive attention module is used to explicitly decompose the input feature map into high-frequency components and low-frequency components, assign weights to the high-frequency components and low-frequency components according to the local energy response and perform weighted summation, and fuse the weighted high-frequency components and low-frequency components to selectively enhance key frequency information.
[0040] Specifically, the frequency-sensitive attention module structure in this embodiment is as follows: Figure 2 As shown, this module can explicitly decompose the feature map into high-frequency and low-frequency components, and adaptively assign different weights to the high-frequency components (and low-frequency components) based on the local energy response, thereby selectively enhancing key information during feature fusion. The weights of different frequency components can be expressed as:
[0041] In the formula and Let P represent the local energy response and the linear intensity measurement values within a k×k neighborhood around position p, respectively. , and These are the weighting coefficients. This represents the sigmoid function.
[0042] The calculation process is as follows: in the k×k neighborhood of position p Within this range, the energy response *ri* and linear intensity *si* are calculated for each of the *k* frequency bands. This is achieved using learnable coefficients. , The two types of statistics are linearly weighted and then superimposed with a bias b. This is then processed using the Sigmoid function. Mapped to probabilistic weights in the (0,1) interval. h and l are frequency component type identifiers, corresponding to the high-frequency components and low-frequency components in the image, respectively. This indicates that the weight calculation is performed independently twice, which physically means that the feature map is explicitly decoupled in the frequency domain. The attention weight for the low-frequency component at location p determines the strength of smoothing / semantic information preservation at that location. In remote sensing imagery, high-frequency information typically corresponds to spatial abrupt changes in the edges, textures, details, and small objects (such as road lines, building outlines, vehicles, etc.) of ground features. This weight determines the strength of the model's preservation of fine geometric structures at the current location. By explicitly calculating this weight, this application can specifically enhance the feature response of small ground features and mitigate the loss of boundary information caused by downsampling. The attention weights for high-frequency components at location p determine the strength of detail / edge information preservation at that location. In remote sensing imagery, low-frequency information typically corresponds to smooth regions, large-scale semantic background (such as lakes, farmland, and sky), slowly changing illumination, and ground feature background. These weights determine the strength of global semantic consistency preservation by the model at the current location. Finally, the calculated weights are used to... and By weighting and fusing the original high-frequency and low-frequency components separately, the network can adaptively preserve the sharp boundaries of complex features such as buildings (high-frequency enhancement) and ensure the semantic integrity of large-area features (such as water bodies and vegetation) (low-frequency preservation) within the same framework, thereby significantly improving the segmentation accuracy of remote sensing images.
[0043] The energy ratio refers to the energy proportion of high-frequency components to low-frequency components in the feature map. It measures the significance of different frequency components in a local region and is used for adaptive weight allocation. For position p in the feature map, the frequency-energy ratio... The formula is in the form of:
[0044] here, Let L2 be the L2 norm of the high-frequency feature at position p, and This represents the L2 norm of the low-frequency domain characteristics. It is a very small positive number used to avoid the denominator being zero. This ratio reflects the relative strength of high-frequency information relative to low-frequency information at position p.
[0045] In an exemplary embodiment, the context-aware Transformer block includes a global-local hybrid attention layer and a gated convolutional feedforward network layer; the global-local hybrid attention layer is used to capture global contextual dependencies and local detailed features; the gated convolutional feedforward network layer is used to selectively propagate and enhance features. See also Figure 3 It includes Norm (normalization layer), GLHA (global-local hybrid attention layer), and GCFN (gated convolutional feedforward network layer).
[0046] The global-local hybrid attention layer can provide efficient global-local information fusion, enabling the model to understand the global semantics of the image and accurately capture local details, thereby achieving better segmentation accuracy on remote sensing data of various sensors and scales.
[0047] In an exemplary embodiment, the global-local hybrid attention layer includes a global attention branch and a local attention branch. The global attention branch calculates global attention weights based on a self-attention mechanism. The local attention branch extracts local features and generates local attention weights through pointwise convolution and depthwise convolution operations. The global-local hybrid attention layer is further used to add the global attention weights and the local attention weights to obtain a hybrid attention output, which serves as the input feature for a gated convolutional feedforward network layer. Thus, by simultaneously capturing global dependencies and local detailed features through a dual-branch parallel architecture, the shortcomings of the standard Transformer in neglecting local structures are addressed.
[0048] Specifically, such as Figure 4 As shown, global-local hybrid attention includes global attention and local attention. Global attention generates Q, K, and V feature vectors for each spatial location's feature vector and calculates the attention weights:
[0049] In the formula, Q represents the dimension of each eigenvector of matrices Q, K, and V, and T represents the transpose operation of matrix K.
[0050] The input features are multiplied by pointwise convolution to obtain features, and the feature vector V is multiplied by depthwise convolution to obtain features, thus yielding local attention weights. Subsequently, the global attention and local attention are summed to obtain the output.
[0051] In an exemplary embodiment, the gated convolutional feedforward network layer includes a first processing branch and a second processing branch; the first processing branch generates intermediate features through cascaded point convolutional layers and deep convolutional layers; the second processing branch generates gate weights through cascaded point convolutional layers, deep convolutional layers, and non-activation function layers; the gate weights are multiplied element-wise with the intermediate features to obtain the enhanced output features.
[0052] Specifically, the gated convolutional feedforward network layer consists of point convolutional layers, depthwise convolutional layers, and non-activation function layers. First, the input features are passed through a point convolutional layer and a depthwise convolutional layer to obtain intermediate features. Simultaneously, the input features are passed through a point convolutional layer, a depthwise convolutional layer, and a non-activation function layer to obtain gate weights. Finally, the gate weights are multiplied element-wise with the intermediate features to obtain the output features.
[0053] In an exemplary embodiment, the decoding head is used to fuse the multi-level decoding features through a multi-layer perceptron to obtain a semantic segmentation result of the remote sensing image, including: The multi-level decoding features are channel aligned to obtain a multi-level feature map with the same number of channels; The multi-level feature maps are upsampled using bilinear interpolation to obtain multi-level feature maps of the same size; The multi-level feature maps of the same size are added and fused together, and then convolutional mapping is performed to obtain the semantic segmentation result of the remote sensing image.
[0054] As an example, firstly, the feature maps of different levels generated by the decoder are convolved with 1×1 to obtain feature maps with the same number of channels.
[0055] Subsequently, the feature maps are upsampled to the same size using bilinear interpolation.
[0056] Finally, the feature maps are summed and then convolved with a 1×1 layer to obtain the final output feature map.
[0057] As an example, a specific example of the technical solution provided in Embodiment 1 is provided, including the following steps: Step 1: Construct the training dataset. Collect and select the required remote sensing image data, and use manual visual interpretation to semantically label each pixel in the image.
[0058] Step 2: Build a data augmentation process to augment the dataset during training.
[0059] Step 3: Train the deep learning network model based on the constructed remote sensing image dataset.
[0060] Step 4: Apply the trained deep learning network model to perform remote sensing image segmentation. Input the remote sensing image to be segmented into the trained model to obtain the final semantic segmentation result of the remote sensing image.
[0061] The deep learning network model employs an encoder-decoder architecture. A multi-scale convolutional attention network is chosen as the backbone encoder for efficient feature extraction. The decoder consists of three parts, as detailed below: The frequency-sensitive attention module enhances features of different frequencies in the output feature maps of different levels of the encoder. Context-aware Transformer blocks are stacked in the decoder. Each context-aware Transformer block contains a global-local hybrid attention layer and a gated convolutional feedforward network layer. The decoding head based on a multilayer perceptron fuses features from different levels of the decoder using the multilayer perceptron to obtain the final segmentation result.
[0062] See Figure 5 The diagram illustrates the semantic segmentation effect of remote sensing images using the method described in this application, showing the image and its corresponding segmentation result. Compared with related remote sensing image segmentation techniques, it has the following advantages: Through the frequency-sensitive attention module, the model can selectively enhance high-frequency information that is crucial to the segmentation boundary, effectively alleviating the problem of blurred ground object boundaries in complex scenes and significantly improving segmentation integrity. The context-aware Transformer block cleverly combines global attention and local convolution operations, achieving powerful context modeling capabilities while maintaining low computational cost. Efficient global-local information fusion enables the model to understand both the global semantics of the image and accurately capture local details, thereby achieving better segmentation accuracy on remote sensing data from various sensors and scales.
[0063] This application proposes a novel encoder-decoder architecture that effectively integrates the local feature extraction capabilities of CNNs and the global context modeling capabilities of Transformers. By reintroducing local inductive biases through a gated convolutional feedforward network, it significantly improves the ability to delineate and recognize the boundaries of fine-grained features (such as roads, rivers, and buildings) in remote sensing images. Furthermore, it introduces a frequency-sensitive attention mechanism to explicitly enhance high-frequency detail information that is crucial for segmentation tasks.
[0064] Example 2.
[0065] This embodiment provides an electronic device whose specific implementation and achieved technical effects are consistent with those described in the above method embodiments, and some details will not be repeated. The electronic device includes one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors according to any of the methods described above.
[0066] Example 3.
[0067] This embodiment provides a computer-readable storage medium, the specific embodiments of which are consistent with the embodiments described in the above method embodiments and the technical effects achieved are the same, and some contents will not be repeated. A computer program is stored thereon, and when the computer program is executed by at least one processor, it implements the steps of any of the methods described above.
[0068] Example 4.
[0069] This embodiment provides a computer program product, the specific embodiments of which are consistent with the embodiments described in the above method embodiments and achieve the same technical effects, and some contents will not be repeated. The computer program product includes a computer program, which, when executed by at least one processor, implements the steps of any of the methods described above.
[0070] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the protection scope of this application.
Claims
1. A remote sensing image segmentation method, characterized in that, The method includes the following steps: The remote sensing image to be segmented is acquired, and the trained deep learning network model is used to obtain the remote sensing image semantic segmentation result of the remote sensing image to be segmented. The deep learning network model includes an encoder-decoder architecture. The encoder is a feature extraction encoder based on a multi-scale convolutional attention network, used to extract multi-scale features. The decoder includes a frequency-sensitive attention module, a context-aware Transformer block, and a decoder head based on a multilayer perceptron. The output feature maps of different levels of the encoder are connected by skip connections and input into the corresponding frequency-sensitive attention module for frequency decomposition to obtain output information after enhancing high-frequency components. The output information of the frequency-sensitive attention module is input into the stacked context-aware Transformer blocks for cascaded context enhancement processing to form multi-level decoded features. The decoder head is used to fuse the multi-level decoded features through a multilayer perceptron to obtain the semantic segmentation result of the remote sensing image.
2. The remote sensing image segmentation method according to claim 1, characterized in that, The training process of the deep learning network model includes: A training dataset is constructed by acquiring remote sensing image data containing target features and preprocessing it. Pixel-level semantic annotation is performed on the preprocessed remote sensing image data to generate label data. The remote sensing image data and the corresponding label data are divided into a training set and a validation set. Data augmentation is performed on the data in the training set, including one or more of random scaling, random horizontal flipping, random vertical flipping, and random rotation. The deep learning network model is iteratively trained based on the data-augmented training dataset, and the model performance during the training process is verified using the validation set.
3. The remote sensing image segmentation method according to claim 1, characterized in that, The frequency-sensitive attention module is used to explicitly decompose the input feature map into high-frequency components and low-frequency components, assign weights to the high-frequency components and low-frequency components according to the local energy response and perform weighted summation, and fuse the weighted high-frequency components and low-frequency components to selectively enhance key frequency information.
4. The remote sensing image segmentation method according to claim 1, characterized in that, The context-aware Transformer block includes a global-local hybrid attention layer and a gated convolutional feedforward network layer; the global-local hybrid attention layer is used to capture global contextual dependencies and local detailed features; the gated convolutional feedforward network layer is used to selectively propagate and enhance features.
5. The remote sensing image segmentation method according to claim 4, characterized in that, The global-local hybrid attention layer includes a global attention branch and a local attention branch; the global attention branch calculates the global attention weights based on a self-attention mechanism. The local attention branch extracts local features and generates local attention weights through point convolution and depth convolution operations; the global-local hybrid attention layer is also used to add the global attention weights and the local attention weights to obtain a hybrid attention output, which is used as the input feature of the gated convolutional feedforward network layer.
6. The remote sensing image segmentation method according to claim 5, characterized in that, The gated convolutional feedforward network layer includes a first processing branch and a second processing branch; the first processing branch generates intermediate features through cascaded point convolutional layers and deep convolutional layers; the second processing branch generates gate weights through cascaded point convolutional layers, deep convolutional layers, and non-activation function layers; the gate weights are multiplied element-wise with the intermediate features to obtain the enhanced output features.
7. The remote sensing image segmentation method according to claim 1, characterized in that, The decoding head is used to fuse the multi-level decoded features through a multi-layer perceptron to obtain the semantic segmentation result of the remote sensing image, including: The multi-level decoding features are channel aligned to obtain a multi-level feature map with the same number of channels; The multi-level feature maps are upsampled using bilinear interpolation to obtain multi-level feature maps of the same size; The multi-level feature maps of the same size are added and fused together, and then convolutional mapping is performed to obtain the semantic segmentation result of the remote sensing image.
8. An electronic device, characterized in that, It includes one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors according to any one of claims 1 to 7.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by at least one processor, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by at least one processor, implements the steps of the method of any one of claims 1 to 7.