Remote sensing image oriented cross-level attention and double flow boundary enhanced semantic segmentation method

By combining a cross-level attention fusion module and a dual-stream boundary enhancement module with a lightweight network, the problems of multi-scale feature fusion and boundary blurring in remote sensing images are solved, achieving high-precision and low-complexity semantic segmentation of remote sensing images.

CN122244448APending Publication Date: 2026-06-19CHINA UNIV OF MINING & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNIV OF MINING & TECH
Filing Date
2026-04-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Remote sensing images suffer from low efficiency in multi-scale feature fusion, blurred boundaries of complex features, and high computational complexity of existing models, making it difficult to meet the needs of rapid processing of large-scale remote sensing data.

Method used

By employing a cross-level attention fusion module (CLAF) and a two-stream boundary enhancement module (DSBE) in conjunction with a lightweight backbone network (EfficientNetV2-T), cross-scale feature fusion and boundary information enhancement are achieved through asymmetric attention mechanism and boundary flow construction, while maintaining low computational complexity.

Benefits of technology

It improves the segmentation accuracy of multi-scale targets, significantly improves the edge quality of complex ground features, and reduces the computational complexity of the model, making it suitable for large-scale remote sensing data processing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244448A_ABST
    Figure CN122244448A_ABST
Patent Text Reader

Abstract

This invention discloses a cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images, belonging to the fields of remote sensing image processing and deep learning technology. The method involves collecting remote sensing images and their corresponding pixel-level labeled images to construct a remote sensing dataset; preprocessing the remote sensing images and pixel-level labeled images to construct a training dataset; constructing a semantic segmentation model for remote sensing images and training it using the training dataset; inputting the remote sensing image to be segmented into the trained semantic segmentation model, determining the land cover category for each pixel in the remote sensing image, and outputting a pixel-level category label matrix with the same spatial resolution as the input remote sensing image to obtain a semantic segmentation map. This method has simple steps and can simultaneously achieve high computational efficiency, high segmentation accuracy, and excellent boundary quality when processing remote sensing images.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images, belonging to the fields of remote sensing image processing and deep learning technology. Background Technology

[0002] Semantic segmentation of remote sensing images is one of the core tasks in the field of Earth observation. It aims to classify ground features for each pixel in a remote sensing image, and is of great significance for urban planning, agricultural monitoring, and disaster assessment. In recent years, deep learning-based methods have made significant progress in this field, especially encoder-decoder architectures such as U-Net, which have become the mainstream paradigm.

[0003] In existing technologies, remote sensing imagery presents unique challenges compared to natural scene images: multi-scale feature fusion is inefficient. The scale differences of ground features in remote sensing imagery are significant. Models like U-Net use skip connections to fuse shallow features (low-level details) from the encoder and deep features (high-level semantics) from the decoder. However, a significant "semantic gap" exists between these two types of features. Traditional concatenation or addition operations cannot effectively distinguish and filter useful information and noise in these features, leading to feature redundancy and insufficient fusion, especially affecting the recognition accuracy of small targets. Although attention mechanisms have been introduced into segmentation networks, they typically employ symmetrical self-attention, failing to fully utilize the complementarity of high- and low-level features in semantic and spatial information for cross-guidance. Complex feature boundary recovery is difficult: feature boundaries in remote sensing imagery are often blurred due to mixed pixels or irregular shapes. In the encoding process of deep networks, repeated downsampling operations, while increasing the receptive field, also lead to the loss of precise spatial positioning information, resulting in blurred final segmentation boundaries. Relying on backbone networks to recover these details is often insufficient, while existing boundary optimization methods are typically computationally complex or dependent on cumbersome post-processing steps. Balancing computational efficiency and performance: High-resolution remote sensing imagery data is massive, requiring models with high throughput. In pursuit of performance, many advanced models have large parameter sets and high computational costs, making it difficult to meet the needs of rapid processing and deployment of large-scale remote sensing data. Summary of the Invention

[0004] To address the shortcomings of existing technologies, this paper proposes a cross-scale attention and dual-stream boundary enhancement semantic segmentation method for remote sensing images. This method can intelligently fuse cross-scale features, accurately recover complex ground object boundaries, and maintain low computational complexity.

[0005] To achieve the above technical objectives, this invention discloses a cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images, comprising the following steps: S1. Collect remote sensing images and their corresponding pixel-level annotation maps to construct a remote sensing dataset; S2. Preprocess the remote sensing images and pixel-level labeled images to construct a training dataset; S3. Construct a semantic segmentation model for remote sensing images and train the model using a training dataset. S4. Input the remote sensing image to be segmented into the trained remote sensing image semantic segmentation model, determine the land cover category for each pixel in the remote sensing image, and output a pixel-level category label matrix with the same spatial resolution as the input remote sensing image to obtain a semantic segmentation map.

[0006] Furthermore, pixel-level annotation maps are used to characterize the land cover category to which each pixel in the remote sensing image belongs; the preprocessing of the remote sensing image and the corresponding pixel-level annotation image includes cropping, standardization and data augmentation.

[0007] Furthermore, the remote sensing image semantic segmentation model includes a sequentially connected encoder, bottleneck layer, decoder, dual-stream boundary enhancement module (DSBE), and segmentation head; The encoder consists of five coding stages, through which it extracts feature maps from the input remote sensing image. Feature map Feature map Feature map Feature map The bottleneck layer will contain the deepest feature map. Mapped to a unified decoding dimension The initial decoding features Initial decoding features Sequential connections have characteristics ,feature ,feature and characteristics The decoder consists of four decoding stages, sequentially decoding the features... With feature map ,feature With feature map ,feature With feature map ,feature With feature map The input is processed by the cross-level attention fusion module CLAF and then by the efficient context decoding module ECD to obtain the decoded output features. ,feature ,feature and characteristics The DSBE module utilizes feature maps. With feature map Construct boundary flows that represent boundary information of areas where different land cover categories meet, and integrate them with features. The fusion yields boundary enhancement features Segmentation head enhances boundary features Pixel-level classification is performed to obtain semantic segmentation maps. Semantic segmentation graph To be compatible with remote sensing images A pixel-level category label matrix with consistent spatial resolution.

[0008] Furthermore, the encoder uses the EfficientNetV2-T model as the backbone network for feature extraction and outputs five-level feature maps. ~ The bottleneck layer is a standard convolutional block StdConv with a kernel size of 3×3, which converts the feature map... The number of channels is mapped to a unified decoding dimension. And generate the initial decoding features Unified decoding dimension D=192; the decoder is at the... Decoding stage = In the middle, the features decoded in the previous stage will be... Scale feature map corresponding to the encoder Inputting the CLAF module yields the fused features. Then, the fusion features Input the ECD module to obtain the decoding features at the current stage. .

[0009] Furthermore, the Cross-Level Attention Fusion (CLAF) module includes an upsampling unit, a high-level feature projection block, a low-level feature projection block, a spatial attention generation branch, a channel attention generation branch, and a fusion refining block; a1. Assume the decoded features from the previous stage are high-level features, and the feature map output by the encoder at the corresponding scale is a low-level feature map; the upsampling unit will convert the high-level features... Upsampling to low-level features via bilinear interpolation The spatial resolution is consistent, and the high-level feature maps are as follows: , , , The low-level feature maps are as follows: , , , ; a2. Perform 1×1 convolution StdConv projections on the upsampled high-level and low-level feature maps respectively through high-level feature projection blocks and low-level feature projection blocks to obtain the high-level projected features. With low-level projection features Both have the same channel count and decoding dimension. ; a3. Based on low-level projection features A spatial attention map is generated by a 3×3 convolution with a spatial attention generation branch and then activated by a Sigmoid function. Using spatial attention maps For advanced projection features Element-wise weighting is performed to obtain spatially guided high-level features. ; a4. From advanced projection features The channel attention weights are generated by adaptive global average pooling and 1×1 convolution through the channel attention generation branch, followed by Sigmoid activation. Utilizing the channel attention weights For low-level projection features By performing element-wise weighting, we obtain the low-level features guided by the channel. ; a5. Advanced features guided by space Low-level features guided by channels The concatenation is performed along the channel dimension, and then fused using a 3×3 convolutional StdConv fusion refinement block in the fusion refinement block to output the fused features. .

[0010] Furthermore, the efficient context decoding module (ECD) includes a depthwise convolutional unit, a compression and activation module (SE), and a pointwise convolutional unit connected in sequence. Its operation is as follows: b1. Utilizing deep convolutional units to fuse features input from the CLAF module A depthwise convolution DwConv is performed, where DwConv is a grouped convolution with a kernel size of 3×3 and groups equal to the number of input channels. This is followed by batch normalized (BN) layer processing and Gaussian error linear unit (GELU) activation to obtain the features. ; b2, Features The input compression and activation module SE generates channel weights through global adaptive average pooling and two 1×1 convolutional layers, and uses these channel weights to optimize the features. Perform element-wise weighting to obtain the weighted features. ; b3. Using pointwise convolutional units to weight features Perform pointwise convolution , The convolution is performed with a kernel size of 1×1, followed by batch normalization (BN) layer processing and Gaussian error linear unit (GELU) activation, outputting the decoded features of the current stage. .

[0011] Furthermore, the dual-stream boundary enhancement module (DSBE) includes a first boundary processing block, a second boundary processing block, a boundary fusion block, and a final fusion block, wherein the first boundary processing block and the second boundary processing block are respectively used to process the shallow feature maps output by the encoder. and shallow feature map To construct boundary flows that characterize the spatial details of the areas where different land cover categories meet; the working process is as follows: c1. parse the shallow feature maps output by the encoder. With shallow feature map The first and second boundary processing blocks are input and projected onto the boundary dimension via StdConv. The first boundary feature is obtained. With the second boundary feature ; c2, the second boundary feature Upsampling to the first boundary feature With consistent spatial resolution, the data is stitched together in the channel dimension and then fused using boundary blending blocks to obtain the boundary flow. ; c3, when Characteristics of backbone decoding output When spatial resolution is inconsistent, the boundary flow will be... Upsampling to the trunk decoding output features Same spatial resolution; c4. Boundary flow Characteristics of backbone decoding output The features are concatenated along the channel dimension and then fused using a final fusion block to obtain the boundary enhancement features. .

[0012] Furthermore, the segmentation head includes refined convolutions, Dropout layers, and 1×1 classification convolutions to generate pixel-level classification logits; and the logits are upsampled to match the input remote sensing image using bilinear interpolation. With the same spatial resolution, semantic segmentation maps are obtained. The reduction ratio of the stimulus module SE For adaptive settings, and ,in The number of channels for the feature is input to the SE module; in a preferred embodiment, =D.

[0013] Furthermore, when training the remote sensing image semantic segmentation model, a composite loss function is used to analyze the semantic segmentation map. Constrained by the real labels, the composite loss function is a weighted sum of cross-entropy loss and overlap loss; the spatial attention generation branch of the CLAF module, the channel attention generation branch of the CLAF module, and the weight generation terminal of the compression and excitation module SE in the remote sensing image semantic segmentation model adopt the Sigmoid function; the nonlinear activation function in the standard convolutional block StdConv, ECD module, and segmentation head adopts the Gaussian error linear unit GELU; the upsampling operation adopts the bilinear interpolation method.

[0014] A computer device includes a processor and a memory, the processor being electrically connected to the memory for storing instructions and data, and the processor being used to execute a cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing imagery.

[0015] Beneficial effects: 1. Based on the characteristics of remote sensing images, the CLAF module achieves more effective cross-scale feature fusion, improving the segmentation accuracy of multi-scale targets; 2. The DSBE module explicitly models and enhances boundary information, significantly improving the quality of segmentation edges for complex features; 3. By adopting a lightweight backbone (EfficientNetV2-T) and ECD module, the computational complexity of the model is significantly reduced, adapting to the needs of large-scale remote sensing data processing. Attached Figure Description

[0016] Figure 1 This is a schematic diagram of the structure of the remote sensing image semantic segmentation model in an embodiment of the present invention.

[0017] Figure 2 This is a schematic diagram of the cross-level attention fusion module CLAF in an embodiment of the present invention.

[0018] Figure 3 This is a schematic diagram of the structure of the Efficient Context Decoding Module (ECD) in an embodiment of the present invention.

[0019] Figure 4 This is a schematic diagram of the dual-flow boundary enhancement module DSBE in an embodiment of the present invention.

[0020] Figure 5 This is a schematic diagram showing the visualization results of the TerraFusionNet model and the State of the art model of this invention on the ISPRS Vaihingen dataset. Detailed Implementation

[0021] The present invention will be further described and illustrated below with reference to the accompanying drawings and specific embodiments.

[0022] This invention discloses a cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images, comprising the following steps: S1. Collect remote sensing images and their corresponding pixel-level annotation maps to construct a remote sensing dataset; S2. Preprocess the remote sensing images and pixel-level labeled images to construct a training dataset; S3. Construct a semantic segmentation model for remote sensing images. The semantic segmentation model for remote sensing images adopts the TerraFusionNet architecture and is trained using the training dataset. S4. Input the remote sensing image to be segmented into the trained remote sensing image semantic segmentation model, determine the land cover category for each pixel in the remote sensing image, and output a pixel-level category label matrix with the same spatial resolution as the input remote sensing image to obtain a semantic segmentation map.

[0023] like Figure 1 As shown, the overall architecture and design principles of the remote sensing image semantic segmentation model are as follows: Figure 1 As shown, the proposed TerraFusionNet employs a U-shaped encoder-decoder structure. Let the input remote sensing image be... Let D be the unified decoding dimension (Unified Decode Channels), with a preferred value of D=192.

[0024] Encoding stage (Encoder) and feature extraction: The encoder uses the efficient EfficientNetV2-T architecture as the backbone network.

[0025] encoder From input image Extract five hierarchical feature maps at different scales: , in The spatial resolution of the feature maps decreases progressively (relative to the input, with step sizes of 1 / 2, 1 / 4, 1 / 8, 1 / 16, and 1 / 32, respectively).

[0026] Bottleneck layer processing: deepest feature layer After passing through a bottleneck layer (Composed of 3x3 ConvStdConv) is processed to convert its channel count into a uniform decoding dimension D=192, generating decoding start features. .

[0027] , The decoding stage is integrated with the core module: the decoder consists of four stages ( ).like Figure 1 As shown, each stage integrates high-level features. (From deeper decoders) and corresponding low-level features (From the encoder). This invention uses a cross-level attention fusion (CLAF) module for feature fusion and an efficient context decoding (ECD) module for feature refinement. The process is defined as follows: .

[0028] like Figure 2 As shown, the Cross-Level Attention Fusion (CLAF) module achieves Cross-Level Guidance through an asymmetric attention mechanism: 1. Feature preparation and projection: First, high-level features (corresponding to...) are prepared and projected. Figure 2 X_high in the model is upsampled using bilinear interpolation to match the low-level features (corresponding to...). Figure 2 Alignment is performed in the X_low space. Then, the two features are channel-projected using 1×1 convolutional projection blocks (StdConv) to unify the number of channels to the decoding dimension D. , ; 2. Asymmetric guidance mechanism: (1) Spatial Attention Generation: Utilizing low-level projection features rich in details Generate spatial attention map This process is achieved through a Convolution maps features to a single channel and activates them using the Sigmoid function; , The spatial attention map is used for high-level projection features. Element-wise weighting is performed to obtain spatially guided high-level features. : , Channel Attention Generation: Utilizes semantically rich high-level projection features. Generate channel attention weights This process utilizes adaptive global average pooling. Capture global information, after a Convolution and Sigmoid activation.

[0029] , The channel attention weights are used for low-level projected features. By performing element-wise weighting, we obtain the low-level features guided by the channel. : ,in This indicates element-wise multiplication.

[0030] 3. Integration and Refinement: Integrating advanced features guided by space. Low-level features guided by channels The components are assembled in the channel dimension and refined by fusion. (·) (StdConv) is used for fusion, and the fused feature F is output: , like Figure 3 As shown, the efficient context decoding module (ECD) is used to efficiently process the fusion features output by the cross-level attention fusion module (CLAF), denoted as . The ECD module consists of a depthwise convolutional unit, a compression and activation module (SE), and a pointwise convolutional unit, which are used to enhance channel representation capabilities while reducing computational complexity. 1. Depthwise Convolution: First, the fused features are processed... Perform a depthwise convolution with a kernel size of 3×3. The number of groups is equal to the number of input channels; subsequently, batch normalization (BN) processing and Gaussian error linear unit (GELU) activation are performed sequentially to characterize the spatial details of the boundary areas between different land cover categories, resulting in feature... Its expression is: , 2. Compression and Excitation (SE) Channel Weighting: Weighting features... The input is the compression and excitation module SE. The SE module first processes the features... Perform global adaptive average pooling Then proceed through the first layer in sequence. Convolution dimensionality reduction, Activation and the second layer Convolutional upscaling is performed, and finally, channel weights are generated through Sigmoid activation. The first layer Convolution and the second layer The learnable parameters of convolution are denoted as follows: and The channel weights The expression is: ; Adaptive scaling ,in The number of channels for the feature is input to the SE module; in a preferred embodiment, =D, the weighted feature is: ,in This indicates element-wise multiplication; 3. Pointwise Convolution: For weighted features The kernel size is [size to be filled in] Pointwise convolution Then, batch normalization (BN) processing was performed and Activate and output decoding features : .

[0031] like Figure 4 As shown, the Dual Stream Boundary Enhancement (DSBE) module contributes to the decoder's feature generation. Then, the DSBE module is introduced to sharpen the segmentation boundaries. DSBE utilizes high-resolution features from the early stages of the encoder. and (like Figure 1 (As shown by the red dashed line) Construct an independent boundary stream, and combine the boundary stream with the decoded output features of the backbone. Fusion yields boundary enhancement features. : 1. Boundary information extraction: This is achieved through the first boundary processing block. and the first boundary processing block shallow feature map and This is processed and projected onto a dedicated boundary dimension. In a preferred example The first boundary feature is obtained. Second boundary features : , 2. Boundary Flow Fusion (Fuse Boundary): This technique combines second boundary features... Upsampled to the first boundary feature using bilinear interpolation. With the same spatial resolution, the blocks are stitched together along the channel dimension and then blended using boundary merge blocks. The flow is fused to generate a unified boundary flow. : , 3. Final Fusion: When the boundary flow... Characteristics of backbone decoding output When the spatial resolution is inconsistent, the boundary flow Upsampling to the trunk decoding output features With the same spatial resolution, then the boundary flow The output feature P1 of the backbone decoding is concatenated with the feature in the channel dimension and then processed through the final fusion block. The features are obtained by fusion and boundary enhancement. : .

[0032] Segmentation Header and Output: such as Figure 1 As shown, the enhanced features The input is fed into the segmentation head. The segmentation head contains refined convolutions, Dropout, and... Categorical convolution generates pixel-level classification logic (OutputLogits): , The final output is obtained by upsampling the Logits to the resolution of the original input image through bilinear interpolation, resulting in the final semantic segmentation map Y.

[0033] Training Strategy and Loss Function: Before being used for actual semantic segmentation, the TerraFusionNet remote sensing image semantic segmentation model needs to be trained using labeled training data. To optimize model parameters, this invention adopts an end-to-end training approach and uses a composite loss function to measure the difference between the final semantic segmentation map Y output by the model and the ground truth label (Ground Truth). The difference between them. This loss function combines cross-entropy loss, which focuses on pixel-level classification accuracy. ) and Dice Loss, which focuses on the overlap of predicted regions ( This helps address the common class imbalance problem in remote sensing imagery. Total loss function The definition is as follows: , in It is a hyperparameter used to balance the weights of the two loss terms (in a preferred embodiment, (Set to 0.5 or 1.0). The goal of model training is to minimize the total loss function using an optimization algorithm (such as the AdamW optimizer), thereby iteratively updating the network parameters until convergence.

[0034] The above-described cross-level attention and dual-stream boundary enhancement semantic segmentation method for remote sensing images will be applied to a specific embodiment to demonstrate its technical effects.

[0035] Implementation and Application Verification of Remote Sensing Image Semantic Segmentation Model: To verify the effectiveness and superiority of the TerraFusionNet method proposed in this invention, this embodiment is implemented based on the open-source remote sensing segmentation framework GeoSeg, and detailed experimental verification is performed on the internationally recognized ISPRS Vaihingen 2D semantic segmentation dataset. The overall process can be divided into four stages: data preprocessing, model training, image semantic segmentation (testing), and result analysis.

[0036] 1. Dataset and Data Preprocessing Stage This embodiment uses the ISPRS Vaihingen dataset, which contains high-resolution aerial orthophotos and corresponding pixel-level feature annotations. The experiment focuses on five main categories: impervious surfaces, buildings, low vegetation, trees, and cars. This embodiment follows the standard data processing workflow of the GeoSeg framework.

[0037] Step 1: Image Tiling. Due to the enormous width of the original remote sensing images, they cannot be directly input into the GPU for training. This embodiment employs a sliding window strategy to crop the original image and its corresponding ground truth into fixed-size image patches, preferably 1024x1024 pixels. An appropriate overlap rate is set during the cropping process to ensure data continuity.

[0038] Step 2: Data Normalization. The cropped image patches are normalized. The mean and standard deviation of the training set images are calculated, and the pixel values ​​of each channel are normalized to make their data distribution approximate a standard normal distribution, thus accelerating model convergence.

[0039] Step 3: Data Augmentation. To improve the model's generalization ability and prevent overfitting, this embodiment employs various online data augmentation strategies during the training phase, including random horizontal / vertical flipping, random rotation (e.g., 90 degrees, 180 degrees, 270 degrees), random scaling, and color perturbation.

[0040] 2. Model Training Phase: Step 1: Model Integration and Configuration. The TerraFusionNet model architecture proposed in this invention is integrated into the GeoSeg framework. The encoder is initialized using EfficientNetV2-T weights pre-trained on ImageNet, while the decoder (including CLAF, ECD, and DSBE modules) is randomly initialized. A uniform decoding dimension D=192 is set, and the boundary dimension... .

[0041] Step 2: Training Parameter Settings. Build the training dataset loader and set a fixed batch size. Use AdamW as the optimizer. Employ either cosine annealing or polynomial decay as the learning rate strategy.

[0042] Step 3: Iterative Training. The model undergoes end-to-end iterative training. In each iteration, the model receives a batch of preprocessed and augmented training samples and performs forward propagation computation. Then, the total loss between the predicted result and the true label (a combination of cross-entropy loss and Dice Loss) is calculated according to formula (19). Finally, the gradient is calculated using the backpropagation algorithm, and the network parameters in the entire model are updated using the optimizer. Training continues until the preset number of iterations is reached or the model performance converges.

[0043] 3. Image Semantic Segmentation (Test Phase): Remote sensing images from the test set are input into the trained TerraFusionNet model for prediction. For large-format images, a sliding window strategy is used to crop the input, and the output prediction blocks are stitched together to obtain a complete image segmentation map. In overlapping window areas, the prediction results are fused using an averaging method to reduce boundary effects. Finally, the Argmax operation is applied to the probability map output by the model, and the class with the highest probability is selected as the final classification result for that pixel.

[0044] 4. Experimental Results and Analysis To fairly evaluate performance, this embodiment conducted 10 independent replicate experiments and calculated the average performance metrics. The experiments compared TerraFusionNet with a baseline model (i.e., a U-shaped network without CLAF, ECD, and DSBE modules) using the same backbone network and training strategy. Evaluation metrics included mean intersection-over-union ratio (mIoU), mean F1 score (mF1), and overall accuracy (OA).

[0045] Table 1 presents the quantitative comparison results of the TerraFusionNet model of this invention and the baseline model on the ISPRS Vaihingen dataset: ; Table 2 presents the quantitative comparison results of the TerraFusionNet model of this invention and state-of-the-art models on the ISPRSVaihingen dataset: .

[0046] Figure 5The visualization results of the TerraFusionNet model of this invention and the state-of-the-art model on the ISPRSVaihingen dataset are presented.

[0047] Experimental results show that the TerraFusionNet proposed in this invention outperforms the baseline model on all key metrics. Specifically, TerraFusionNet achieves an average mIoU of 85.67%, a 0.77% improvement over the baseline model's 84.90%. This significant improvement validates the effectiveness of the core modules introduced in this invention: the Cross-Scale Attention Fusion (CLAF) module, through an asymmetric guidance mechanism, more effectively fuses multi-scale features, improving the recognition accuracy of various land features; the Two-Stream Boundary Enhancement (DSBE) module, through independent boundary flows, makes the edges of the segmentation results clearer and more accurate; and the Efficient Context Decoding (ECD) module maintains the model's computational efficiency while ensuring performance. The experimental results fully demonstrate the superiority of the proposed method in remote sensing image semantic segmentation tasks.

[0048] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the invention. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, all technical solutions obtained through equivalent substitution or transformation fall within the protection scope of the present invention.

Claims

1. A cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images, characterized in that, Includes the following steps: S1. Collect remote sensing images and their corresponding pixel-level annotation maps to construct a remote sensing dataset; S2. Preprocess the remote sensing images and pixel-level labeled images to construct a training dataset; S3. Construct a semantic segmentation model for remote sensing images and train the model using a training dataset. S4. Input the remote sensing image to be segmented into the trained remote sensing image semantic segmentation model, determine the land cover category for each pixel in the remote sensing image, and output a pixel-level category label matrix with the same spatial resolution as the input remote sensing image to obtain a semantic segmentation map.

2. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 1, characterized in that, Pixel-level annotation maps are used to characterize the land cover category to which each pixel in the remote sensing image belongs; the preprocessing of the remote sensing image and the corresponding pixel-level annotation image includes cropping, standardization and data augmentation.

3. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 1, characterized in that, The remote sensing image semantic segmentation model includes a sequentially connected encoder, bottleneck layer, decoder, dual-stream boundary enhancement module (DSBE), and segmentation head; The encoder consists of five coding stages, through which it extracts feature maps from the input remote sensing image. Feature map Feature map Feature map Feature map The bottleneck layer will contain the deepest feature map. Mapped to a unified decoding dimension The initial decoding features Initial decoding features Sequential connections have characteristics ,feature ,feature and characteristics The decoder consists of four decoding stages, sequentially decoding the features... With feature map ,feature With feature map ,feature With feature map ,feature With feature map The input is processed by the cross-level attention fusion module CLAF and then by the efficient context decoding module ECD to obtain the decoded output features. ,feature ,feature and characteristics The DSBE module utilizes feature maps. With feature map Construct boundary flows that represent boundary information of areas where different land cover categories meet, and integrate them with features. The fusion yields boundary enhancement features ; Segmentation head enhances boundary features Pixel-level classification is performed to obtain semantic segmentation maps. Semantic segmentation graph To be compatible with remote sensing images A pixel-level category label matrix with consistent spatial resolution.

4. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 3, characterized in that, The encoder uses the EfficientNetV2-T model as the backbone network for feature extraction and outputs five-level feature maps. ~ The bottleneck layer is a standard convolutional block StdConv with a kernel size of 3×3, which converts the feature map... The number of channels is mapped to a unified decoding dimension. And generate the initial decoding features Unified decoding dimension D=192; the decoder is at the... Decoding stage = In the middle, the features decoded in the previous stage will be... Scale feature map corresponding to the encoder Inputting the CLAF module yields the fused features. Then, the fusion features Input the ECD module to obtain the decoding features at the current stage. .

5. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 4, characterized in that, The Cross-Level Attention Fusion Module (CLAF) includes an upsampling unit, a high-level feature projection block, a low-level feature projection block, a spatial attention generation branch, a channel attention generation branch, and a fusion refining block. a1. Assume the decoded features from the previous stage are high-level features, and the feature map output by the encoder at the corresponding scale is a low-level feature map; the upsampling unit will convert the high-level features... Upsampling to low-level features via bilinear interpolation The spatial resolution is consistent, and the high-level feature maps are as follows: , , , The low-level feature maps are as follows: , , , ; a2. Perform 1×1 convolution StdConv projections on the upsampled high-level and low-level feature maps respectively through high-level feature projection blocks and low-level feature projection blocks to obtain the high-level projected features. With low-level projection features Both have the same channel count and decoding dimension. ; a3. Based on low-level projection features After a 3×3 convolution with a spatial attention generation branch, a spatial attention map is generated using Sigmoid activation. Using spatial attention maps For advanced projection features Element-wise weighting is performed to obtain spatially guided high-level features. ; a4. From advanced projection features The channel attention weights are generated by adaptive global average pooling and 1×1 convolution in the channel attention generation branch, and then activated by Sigmoid. Utilizing the channel attention weights For low-level projection features By performing element-wise weighting, we obtain the low-level features guided by the channel. ; a5. Advanced features guided by space Low-level features guided by channels The concatenation is performed along the channel dimension, and then fused using a 3×3 convolutional StdConv fusion refinement block in the fusion refinement block to output the fused features. .

6. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 5, characterized in that, The high-efficiency context decoding module (ECD) comprises a depthwise convolutional unit, a compression and activation module (SE), and a pointwise convolutional unit connected in sequence. Its operation is as follows: b1. Utilizing deep convolutional units to fuse features input from the CLAF module A depthwise convolution DwConv is performed, where DwConv is a grouped convolution with a kernel size of 3×3 and groups equal to the number of input channels. This is followed by batch normalized (BN) layer processing and Gaussian error linear unit (GELU) activation to obtain the features. ; b2, Features The input compression and activation module SE generates channel weights through global adaptive average pooling and two 1×1 convolutional layers, and uses these channel weights to optimize the features. Perform element-wise weighting to obtain the weighted features. ; b3. Using pointwise convolutional units to weight features Perform pointwise convolution , The convolution is performed with a kernel size of 1×1, followed by batch normalization (BN) layer processing and Gaussian error linear unit (GELU) activation, outputting the decoded features of the current stage. .

7. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 6, characterized in that, The dual-stream boundary enhancement module (DSBE) includes a first boundary processing block, a second boundary processing block, a boundary fusion block, and a final fusion block. The first and second boundary processing blocks are used to process the shallow feature maps output by the encoder, respectively. and shallow feature map To construct boundary flows that characterize the spatial details of the areas where different land cover categories meet; the working process is as follows: c1. parse the shallow feature maps output by the encoder. With shallow feature map The first and second boundary processing blocks are input and projected onto the boundary dimension via StdConv. The first boundary feature is obtained. With the second boundary feature ; c2, the second boundary feature Upsampling to the first boundary feature With consistent spatial resolution, the data is stitched together in the channel dimension and then fused using boundary blending blocks to obtain the boundary flow. ; c3, when Characteristics of backbone decoding output When spatial resolution is inconsistent, the boundary flow Upsampling to the trunk decoding output features Same spatial resolution; c4. Boundary flow Characteristics of backbone decoding output The features are concatenated along the channel dimension and then fused using a final fusion block to obtain the boundary enhancement features. .

8. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 7, characterized in that, The segmentation head includes refined convolutions, Dropout layers, and 1×1 classification convolutions to generate pixel-level classification logits; and then upsamples the logits to match the input remote sensing image using bilinear interpolation. With the same spatial resolution, semantic segmentation maps are obtained. ; Reduction ratio of the excitation module SE For adaptive settings, and ,in Input the number of channels of the feature into the SE module; In a preferred embodiment =D.

9. The cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images according to claim 8, characterized in that, When training the remote sensing image semantic segmentation model, a composite loss function is used to analyze the semantic segmentation map. Constrained by the real labels, the composite loss function is a weighted sum of cross-entropy loss and overlap loss; the spatial attention generation branch of the CLAF module, the channel attention generation branch of the CLAF module, and the weight generation terminal of the compression and excitation module SE in the remote sensing image semantic segmentation model adopt the Sigmoid function; the nonlinear activation function in the standard convolutional block StdConv, ECD module, and segmentation head adopts the Gaussian error linear unit GELU; the upsampling operation adopts the bilinear interpolation method.

10. A computer device, characterized in that, It includes a processor and a memory, the processor being electrically connected to the memory, the memory being used to store instructions and data, and the processor being used to execute the cross-level attention and two-stream boundary enhancement semantic segmentation method for remote sensing images as described in any one of claims 1-9.