A remote sensing image segmentation method and system based on cross-paradigm feature fusion and alignment

By employing cross-paradigm feature fusion and alignment methods, the problem of insufficient accuracy in remote sensing image segmentation is solved, achieving high-precision recognition and robustness improvement of fine-grained targets in complex scenes, especially in detail capture capabilities during multi-scale and multi-modal image fusion.

CN120807934BActive Publication Date: 2026-06-23耕宇牧星(北京)空间科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
耕宇牧星(北京)空间科技有限公司
Filing Date
2025-08-01
Publication Date
2026-06-23

Smart Images

  • Figure CN120807934B_ABST
    Figure CN120807934B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on paradigm-crossing feature fusion and alignment remote sensing image segmentation method and system, comprising: the initial feature is extracted to the input remote sensing image and is preprocessed;Input paradigm-crossing feature fusion and alignment network, through sparse channel enhancement and spatial alignment and spatial pixel refining and channel alignment, fusion multi-modal, cross-scale remote sensing image structure information, obtain first stage fusion feature;Input multi-stage paradigm-crossing enhanced feature extraction network, through multi-level information interaction and dynamic gate mechanism, fusion local details and global context information, gradually extract out the joint feature map of semantic and spatial structure collaborative expression;Through segmentation head, generate final semantic segmentation result, and calculate composite loss based on real label;The application is by constructing multi-stage feature extraction network and paradigm-crossing feature alignment mechanism, effectively fusion local texture, spatial context and multi-modal information, while guaranteeing the computing efficiency, strengthen segmentation performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of remote sensing image processing technology, and more specifically to a remote sensing image segmentation method and system based on cross-paradigm feature fusion and alignment. Background Technology

[0002] Remote sensing image segmentation, as a core technology in fields such as geographic information systems, urban planning, agricultural monitoring, and disaster assessment, has received widespread attention in recent years. Its goal is to perform semantic classification on pixels in high-resolution remote sensing images, accurately assigning each pixel to the corresponding land cover category, such as buildings, water bodies, roads, farmland, and forest land.

[0003] With the continuous improvement of remote sensing data quality and the diversification of remote sensing imaging methods, how to achieve high-precision, multi-scale, and structure-aware image segmentation in complex scenes has become one of the key challenges in current research.

[0004] Most current mainstream methods are based on convolutional neural network (CNN) or Transformer structures, which complete pixel-level classification of ground objects by extracting multi-level semantic information of images. However, traditional CNNs have the problem of limited receptive field, making it difficult to capture target information with long-distance dependencies in remote sensing images. While the standard Transformer has strong global modeling capabilities, it is not good at expressing local details and has problems of structural alignment and redundant computation when processing multi-scale targets or remote sensing images of different modalities (such as optical and SAR fusion).

[0005] In addition, remote sensing images present unique challenges such as uneven distribution of target categories, blurred boundaries, a large proportion of small targets, and overlapping areas at multiple scales. These factors together limit the accuracy of segmentation models, especially in edge details and minority class detection. Summary of the Invention

[0006] In view of this, the present invention provides a remote sensing image segmentation method and system based on cross-paradigm feature fusion and alignment, which aims to improve the segmentation accuracy of complex land cover structures in remote sensing images, especially in challenging scenarios such as large differences in land cover scale, blurred boundaries, and uneven distribution of categories, to achieve robust perception and accurate identification of fine-grained target regions.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] A remote sensing image segmentation method based on cross-paradigm feature fusion and alignment includes the following steps:

[0009] S1. Preprocess the input remote sensing image and input it into the initial convolutional layer to extract low-level texture and edge features to obtain the initial features;

[0010] S2. Input the initial features into the constructed cross-paradigm feature fusion and alignment network, and through sparse channel enhancement and spatial alignment, as well as spatial pixel refinement and channel alignment, fuse multimodal and cross-scale remote sensing image structural information to obtain the first-stage fused features;

[0011] S3. Input the first-stage fused features into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner.

[0012] S3. Generate the final semantic segmentation result using the segmentation head based on the joint feature map, and calculate the composite loss based on the real labels to optimize the model parameters.

[0013] Preferably, step S2 includes the following:

[0014] S21. Perform linear transformation on the input features through a convolutional layer to extract preliminary semantic features;

[0015] S22. The initial semantic features are enhanced with a sparse attention mechanism and a spatial alignment strategy to improve the response of key regions, resulting in sparse channel features and sparse channel spatial weights.

[0016] S23. Enhance spatial structure perception capability for preliminary semantic features at the pixel level, and selectively align channels, spatial enhancement features and spatial alignment weights;

[0017] S24. Perform weighted fusion of sparse channel features and spatial enhancement features respectively, and output the fusion result through linear transformation.

[0018] Preferably, step S22 includes the following:

[0019] S221. The initial semantic features are fed into four parallel sub-branches with different but complementary functions to form a multi-view feature perception mechanism, generating a query matrix Q, a key matrix K, a sparse attention index ρ, and a value matrix V.

[0020] S222. Construct a sparse attention map, reshape Q and K, perform dot product to obtain the initial attention matrix, filter key attention scores according to index ρ to obtain a sparse attention map, and apply it to V after Softmax activation to obtain sparse channel features. Then, pass the sparse channel features through convolution, GELU activation, reconvolution and Sigmoid to obtain channel space fusion weights.

[0021] Preferably, step S23 includes the following:

[0022] S231. Generate spatial alignment weights:

[0023]

[0024] S232. Spatial augmentation features are generated by combining global average pooling with weights to enhance the spatial response:

[0025]

[0026] S233. Generating spatial alignment weights based on spatial augmentation features:

[0027]

[0028] Preferably, step S3 includes the following:

[0029] S31. Construct a multi-scale information flow gating network, normalize the first-stage fusion features, then obtain the fusion features by forming a lightweight gating mechanism through two parallel sub-paths, then restore the channels through 1×1 convolution, and perform residual connection with the normalized features to obtain the first-stage low-level enhancement features.

[0030] S32. The first-stage low-level enhancement features are sequentially passed through two stacked cross-paradigm feature fusion and alignment networks and a multi-scale information flow gating network to obtain the second-stage enhancement features and the third-stage enhancement features. After splicing, they are further input into a stacked cross-paradigm feature fusion and alignment network and a multi-scale information flow gating network to obtain the high-order abstract features.

[0031] S33. The low-level enhanced features generated in the first stage are concatenated with the high-level abstract features generated in the last stage to form the final joint feature map.

[0032] Preferably, step S31, which uses two parallel sub-paths to form a lightweight gating mechanism to obtain the fused features, includes the following:

[0033] The normalized feature input is used to construct two parallel sub-paths. Path A compresses the channel dimension through 1×1 convolution and then extracts local context information through 3×3 depthwise separable convolution. Path B compresses the channel dimension through 1×1 convolution, and uses 3×3 depthwise separable convolution and GELU nonlinear activation to model nonlinear response. The outputs of path A and path B are multiplied point by point to form fused features.

[0034] Preferably, in step S3, the composite loss includes pixel-level cross-entropy loss, boundary-aware loss, and class balance loss;

[0035]

[0036] Where λ1, λ2, and λ3 are weighting coefficients. For pixel-level cross-entropy loss, For boundary-aware loss, This is the category-balanced loss.

[0037] A remote sensing image segmentation system based on cross-paradigm feature fusion and alignment, based on the aforementioned remote sensing image segmentation method based on cross-paradigm feature fusion and alignment, includes: an image acquisition module, an initial feature extraction module, a first-stage feature fusion module, a multi-stage cross-paradigm enhanced feature extraction module, and a prediction and optimization module;

[0038] The image acquisition module is used to acquire remote sensing images and perform preprocessing.

[0039] The initial feature extraction module is used to extract low-level texture and edge features from the preprocessed remote sensing image through the initial convolutional layer to obtain initial features;

[0040] The first-stage feature fusion module is used to integrate the multimodal and cross-scale remote sensing image structural information into the cross-paradigm feature fusion and alignment network constructed from the initial feature input to obtain the first-stage fused features. The cross-paradigm feature fusion and alignment network includes a sparse channel enhancement and spatial alignment module as well as a spatial pixel refinement and channel alignment module.

[0041] The multi-stage cross-paradigm enhanced feature extraction module is used to take the first-stage fused feature input into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner.

[0042] The prediction and optimization module is used to generate the final semantic segmentation result from the joint feature map through the segmentation head, and to calculate the composite loss based on the real labels to optimize the model parameters.

[0043] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned remote sensing image segmentation method based on cross-paradigm feature fusion and alignment.

[0044] A processing terminal includes a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the computer program, it implements the aforementioned remote sensing image segmentation method based on cross-paradigm feature fusion and alignment.

[0045] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a remote sensing image segmentation method and system based on cross-paradigm feature fusion and alignment, which has the following beneficial effects:

[0046] (1) A multi-stage cross-paradigm enhanced feature extraction network is proposed. Through progressive information fusion, multi-scale gating mechanism and dynamic interaction between shallow and deep features, the model’s ability to express detailed features in complex scenes (such as building edges, road skeletons, water boundaries, etc.) is significantly enhanced. Compared with traditional single-scale or static fusion models, it has significant improvements in accuracy and robustness.

[0047] (2) Construct a cross-paradigm feature fusion and alignment network, introduce sparse attention channel modeling and spatial structure alignment mechanism, and combine boundary perception supervision and class balance loss design. This not only improves the model's ability to identify target edges and small sample classes, but also effectively alleviates the performance bottleneck caused by "inter-class inhomogeneity" and "scale inconsistency" in remote sensing images, and improves the overall segmentation consistency and generalization ability. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0049] Figure 1 A schematic diagram illustrating a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment provided by the present invention;

[0050] Figure 2 This is a schematic diagram of the cross-paradigm feature fusion and alignment network provided by the present invention. Detailed Implementation

[0051] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0052] This invention discloses a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment, such as... Figure 1 This includes the following steps:

[0053] S1. Input remote sensing image Preprocessing is performed and input into the initial convolutional layer to extract low-level texture and edge features, resulting in the initial feature X0 = Conv(I), where Conv represents the convolutional layer;

[0054] S2. Input the initial feature X0 into the constructed cross-paradigm feature fusion and alignment network. Through sparse channel enhancement and spatial alignment, as well as spatial pixel refinement and channel alignment, fuse multimodal and cross-scale remote sensing image structural information to obtain the first-stage fused features.

[0055] S3. Input the first-stage fused features into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner.

[0056] S3. Generate the final semantic segmentation result using the segmentation head based on the joint feature map, and calculate the composite loss based on the real labels to optimize the model parameters.

[0057] To further implement the above technical solutions, such as Figure 2 The specific content of step S2 includes:

[0058] S21. Perform linear transformation on the input features through a convolutional layer to extract preliminary semantic features;

[0059] Let the input features be Where H, W, and C are the height, width, and number of channels of the feature map, respectively, and the preliminary semantic representation is as follows:

[0060] S22. The initial semantic features are enhanced with a sparse attention mechanism and a spatial alignment strategy to improve the response of key regions, resulting in sparse channel features and sparse channel spatial weights.

[0061] S23. Enhance spatial structure perception capability for preliminary semantic features at the pixel level, and selectively align channels, spatial enhancement features and spatial alignment weights;

[0062] S24. Perform weighted fusion of sparse channel features and spatial enhancement features respectively, and output the fusion result through linear transformation.

[0063] To further implement the above technical solution, the specific content of step S22 includes:

[0064] S221. The initial semantic features are fed into four parallel sub-branches with different but complementary functions to form a multi-view feature perception mechanism, generating a query matrix Q, a key matrix K, a sparse attention index ρ, and a value matrix V.

[0065] Specifically:

[0066] The first branch generates the query matrix Q. This branch effectively compresses computational complexity while maintaining spatial awareness by concatenating standard convolutions and depthwise separable convolutions. Its core objective is to extract latent attention query vectors from the input features, capturing structural information with significant differences in the image. The query matrix Q plays a role in focusing the target region in the attention mechanism and is the key starting point for matching relationships;

[0067] The second branch generates the key matrix K. This branch has the same structure as the first, employing a convolutional + depthwise separable convolution structure. Its purpose is to form key vectors semantically aligned with the query vector, capturing response locations in the entire image similar to the target region. In remote sensing images, similar land cover categories may appear in different spatial locations, so this branch helps to achieve spatial matching at the semantic level;

[0068] The third branch is used to generate the sparse attention index ρ. This branch adaptively learns the distribution probability of important attention channels through layer-by-layer compression and nonlinear mapping operations, thereby eliminating low-value attention connections. The entire process is represented as follows:

[0069]

[0070] Wherein, Norm(·) is the normalization operation, σ represents the Sigmoid activation function, Flatten is the flattening operation, and GAP represents global average pooling; this structure effectively captures the importance scores across channels and plays the role of "channel gating" in the sparse attention mechanism, so that the computation focuses on high contribution channels, improving computational efficiency and model discriminative power;

[0071] The fourth branch generates the value matrix V. This branch also uses a concatenated form of convolution and depthwise separable convolution, with the aim of generating a value vector for weighted use in the attention matrix. This vector contains the spatial semantics and contextual features of the image.

[0072]

[0073] In the process of cross-regional feature fusion, V, as the carrier of feature response, directly affects the representation effect after attention enhancement due to its information expression quality.

[0074] S222. Construct a sparse attention graph, reshape Q and K, and perform a dot product to obtain the initial attention matrix. Based on the index ρ, key attention scores are filtered to obtain sparse attention mappings. After applying Softmax activation to V, sparse channel features are obtained. sparse channel features Channel-space fusion weights are obtained through convolution, GELU activation, reconvolution, and Sigmoid.

[0075] To further implement the above technical solution, the specific content of step S23 includes:

[0076] S231. Generate spatial alignment weights:

[0077]

[0078] S232. Spatial augmentation features are generated by combining global average pooling with weights to enhance the spatial response:

[0079]

[0080] S233. Generating spatial alignment weights based on spatial augmentation features:

[0081]

[0082] In this embodiment, the fusion result output in step S24 for:

[0083]

[0084] Where ⊙ represents point-by-point multiplication. This indicates the addition of features.

[0085] In this embodiment, the cross-paradigm feature fusion and alignment network is designed to better integrate feature representations of different scales and modalities (such as optical and SAR) in remote sensing images, thereby improving the segmentation capability for fine-grained structures and semantic regions.

[0086] To further implement the above technical solution, step S3 includes the following:

[0087] S31. Construct a multi-scale information flow gating network to control the fusion features of the first stage. Perform normalization operation Here, Norm represents the normalization operation. Then, a lightweight gating mechanism is formed through two parallel sub-paths to obtain the fused feature X. This feature X is then subjected to 1×1 convolution for channel restoration and residually connected with the normalized feature to obtain the first-stage low-level enhanced feature. This operation improves the stability and semantic consistency of feature delivery;

[0088] S32. Multi-stage alternating fusion: The first stage low-level enhancement features are sequentially passed through two stacked cross-paradigm feature fusion and alignment networks and multi-scale information flow gating networks to obtain the second stage enhancement features and the third stage enhancement features. After splicing, they are further input into a stacked cross-paradigm feature fusion and alignment network and multi-scale information flow gating network to obtain high-order abstract features.

[0089] In this embodiment, the multi-stage alternating stacked fusion structure enhances features progressively layer by layer. Specifically, the enhanced features output from the previous stage are first passed through a cross-paradigm feature fusion and alignment network to extract more complete contextual semantics. Subsequently, the features are entered into a multi-scale information flow gating network to enhance adaptive perception of spatial scale changes. After the first round of fusion, the intermediate features obtained are processed again to form the second round of enhanced features. After completing the two rounds of stacking, the features extracted from these two stages are concatenated to form a composite feature expression with multi-scale response. The concatenated features are further input into a set of cross-paradigm feature fusion modules and information flow gating modules to extract higher-level abstract features, achieving deep fusion across levels and semantics.

[0090] S33. Global Feature Convergence and Output: After the multi-stage enhanced feature extraction is completed, the low-level enhanced features generated in the first stage are concatenated with the high-level abstract features generated in the last stage to form the final joint feature map. This joint feature map retains both shallow texture details and deep semantic expressions, and has rich spatial context information.

[0091] In this embodiment, this bottom-up feature convergence method can effectively improve the segmentation accuracy of complex scenes (such as urban buildings, road networks, water body boundaries, farmland areas, etc.) in remote sensing images, and shows better robustness and expressive ability, especially when processing images with large differences in the scale of ground objects and high boundary ambiguity.

[0092] To further implement the above technical solution, step S31, which uses two parallel sub-paths to form a lightweight gating mechanism to obtain the fused features, includes the following:

[0093] The normalized feature input is used to construct two parallel sub-paths. Path A compresses the channel dimension through 1×1 convolution and then extracts local context information through 3×3 depthwise separable convolution. Path B compresses the channel dimension through 1×1 convolution, and uses 3×3 depthwise separable convolution and GELU nonlinear activation to model nonlinear response. The outputs of path A and path B are multiplied point by point to form fused features.

[0094] In this embodiment, the fusion feature obtained in step S31 is represented as follows:

[0095] X+ =DSConv(Conv(X) n ))☉GELU(DSConv(Conv(X n )))

[0096] Here, ⊙ represents element-wise multiplication, and DSConv represents depthwise separable convolution. The two parallel sub-paths constitute a lightweight gating mechanism, which enhances the model's selective response to multi-scale heterogeneous targets.

[0097] To further implement the above technical solution, in step S3, the composite loss includes pixel-level cross-entropy loss, boundary-aware loss, and class balance loss.

[0098]

[0099] Where λ1, λ2, and λ3 are weighting coefficients. For pixel-level cross-entropy loss, For boundary-aware loss, This is the category-balanced loss.

[0100] In this embodiment, in order to map the feature map to a semantic category probability map, the segmentation head is a lightweight segmentation head, which consists of a series of convolutional layers, upsampling layers, and softmax layers. This represents the final output prediction map of the model, where H and W are the image dimensions, C is the number of land cover categories, and P is the vector at each pixel location. i This indicates the predicted probability that the pixel belongs to any class.

[0101] The pixel-level cross-entropy loss is specifically as follows: To achieve basic classification accuracy, a conventional multi-class cross-entropy loss function is used to supervise the prediction probability map P and the ground truth label map G pixel by pixel. Where each pixel value is the actual land cover category number, the cross-entropy loss is:

[0102]

[0103] Where N = H × W, This is an indicator function; if the true class of the i-th pixel is c, then this term is 1.

[0104] The introduction of boundary-aware loss and class balance loss aims to further improve the model's ability to recognize small object boundaries and slender structures (such as roads and rivers), while also addressing the class imbalance problem in remote sensing images. Specifically:

[0105] Boundary-aware loss To enhance the model's sensitivity to ground feature edge structures, a boundary loss mechanism is introduced. Specifically, edge extraction operators (such as Sobel or Laplacian) are used to generate boundary masks E for the ground truth label map G and the predicted map P, respectively. G and E P Calculate the binary cross-entropy loss between them:

[0106]

[0107] Among them, E G =EdgeDetect(G), which means performing edge detection on the ground truth label map; E P =EdgeDetect(argmax(P)) indicates edge detection on the predicted class map; this loss term effectively improves the model's prediction accuracy at pixel boundaries, especially for small targets and regions with significant structural changes;

[0108] Category balance loss To address the issue of severely uneven class distribution in remote sensing images, FocalLoss or Dice Loss are introduced to enhance the learning ability for difficult-to-classify samples or regions with small class sizes. For example, Dice Loss is used as an example:

[0109]

[0110] in, This represents the probability that the i-th pixel is predicted to be its true class; the numerator is the overlap between the prediction and the label, and the denominator is the sum of the total number of predictions and the total number of true values; Dice Loss can effectively balance the influence of the foreground and background, and is especially suitable for small target detection scenarios (such as narrow roads, islands, aircraft, etc.);

[0111] The final composite loss function is constructed to constrain model performance from multiple perspectives. The typical values ​​of the weighting coefficients are λ1 = 1.0, λ2 = 0.5, and λ3 = 0.5. The combination of the three losses can optimize model performance from three perspectives: overall accuracy, boundary clarity, and inter-class balance. This composite mechanism is particularly suitable for remote sensing image tasks with multiple targets, multiple categories, and complex boundaries.

[0112] In this embodiment, the present invention proposes a multi-stage cross-paradigm enhancement fusion remote sensing image segmentation system, constructs a cross-paradigm feature extraction structure that integrates the advantages of Transformer and CNN, and achieves accurate modeling of complex features in remote sensing images through sparse attention mechanism, spatial alignment strategy, multi-scale information flow gating network, and boundary awareness and class balance supervision mechanism.

[0113] Specifically, a multi-stage cross-paradigm enhanced feature extraction network is constructed, which utilizes a multi-level, alternating information enhancement mechanism to integrate local texture and global context, and gradually extracts high-quality feature maps that co-express semantics and spatial structure. In this process, the dynamic gating mechanism can adjust features according to the scale of ground objects and semantic intensity, effectively addressing the expression bias of heterogeneous targets.

[0114] A cross-paradigm feature fusion and alignment network is introduced, combining convolutional paths and sparse attention paths. A four-branch structure is designed to model the query, key, value matrix and channel importance index respectively. Through spatial reconstruction and channel compression strategies, the spatial alignment capability and semantic fusion effect of multimodal information are improved. At the same time, by combining spatial pixel refinement and channel alignment modules, spatial structure perception enhancement and channel-level response difference calibration are achieved, effectively strengthening the model's sensitivity to complex texture regions and structural boundaries.

[0115] To address the challenges of small target detection and uneven class distribution commonly found in remote sensing images, a composite supervision mechanism was designed, including cross-entropy loss, boundary-aware loss, and class balance loss. Boundary loss significantly improves the model's responsiveness to pixel edge regions, while class balance loss (such as Dice or FocalLoss) enhances the model's robustness in minority class detection. The segmentation head employs a lightweight structure and combines Softmax probability mapping to output the final semantic segmentation result, ensuring an effective balance between model accuracy and computational efficiency.

[0116] In summary, this invention provides a systematic solution to problems such as accuracy bottlenecks, blurred boundaries, and uneven class distribution in remote sensing image segmentation by integrating key technologies such as multi-stage progressive feature extraction, cross-paradigm fusion alignment, joint spatial channel control, and composite supervised optimization. It has broad practical application prospects and research value.

[0117] A remote sensing image segmentation system based on cross-paradigm feature fusion and alignment, based on a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment, includes: an image acquisition module, an initial feature extraction module, a first-stage feature fusion module, a multi-stage cross-paradigm enhanced feature extraction module, and a prediction and optimization module;

[0118] The image acquisition module is used to acquire remote sensing images and perform preprocessing.

[0119] The initial feature extraction module is used to extract low-level texture and edge features from the preprocessed remote sensing image through the initial convolutional layer to obtain initial features;

[0120] The first-stage feature fusion module is used to integrate the multimodal and cross-scale remote sensing image structural information into the cross-paradigm feature fusion and alignment network constructed from the initial feature input to obtain the first-stage fused features. The cross-paradigm feature fusion and alignment network includes a sparse channel enhancement and spatial alignment module as well as a spatial pixel refinement and channel alignment module.

[0121] The multi-stage cross-paradigm enhanced feature extraction module is used to take the first-stage fused feature input into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner.

[0122] The prediction and optimization module is used to generate the final semantic segmentation result from the joint feature map through the segmentation head, and to calculate the composite loss based on the real labels to optimize the model parameters.

[0123] A computer-readable storage medium having a computer program stored thereon that, when executed by a processor, implements a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment.

[0124] A processing terminal includes a memory and a processor. The memory stores a computer program that can run on the processor. When the processor executes the computer program, it implements a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment.

[0125] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.

[0126] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A remote sensing image segmentation method based on cross-paradigm feature fusion and alignment, characterized in that, Includes the following steps: S1. Preprocess the input remote sensing image and input it into the initial convolutional layer to extract low-level texture and edge features to obtain the initial features; S2. Input the initial features into the constructed cross-paradigm feature fusion and alignment network, and through sparse channel enhancement and spatial alignment, as well as spatial pixel refinement and channel alignment, fuse multimodal and cross-scale remote sensing image structural information to obtain the first-stage fused features; S3. Input the first-stage fused features into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner. S4. Generate the final semantic segmentation result using the segmentation head based on the joint feature map, and calculate the composite loss based on the real labels to optimize the model parameters; The specific content of step S3 includes: S31. Construct a multi-scale information flow gating network, normalize the first-stage fusion features, then obtain the fusion features by forming a lightweight gating mechanism through two parallel sub-paths, then restore the channels through 1×1 convolution, and perform residual connection with the normalized features to obtain the first-stage low-level enhancement features. S32. The first-stage low-level enhancement features are sequentially passed through two stacked cross-paradigm feature fusion and alignment networks and a multi-scale information flow gating network to obtain the second-stage enhancement features and the third-stage enhancement features. After splicing, they are further input into a stacked cross-paradigm feature fusion and alignment network and a multi-scale information flow gating network to obtain the high-order abstract features. S33. The low-level enhanced features generated in the first stage are concatenated with the high-level abstract features generated in the last stage to form the final joint feature map; Step S31, obtaining the fused features through a lightweight gating mechanism constructed by two parallel sub-paths, includes the following: The normalized feature input is used to construct two parallel sub-paths. Path A compresses the channel dimension through 1×1 convolution and then extracts local context information through 3×3 depthwise separable convolution. Path B compresses the channel dimension through 1×1 convolution, and uses 3×3 depthwise separable convolution and GELU nonlinear activation to model nonlinear response. The outputs of path A and path B are multiplied point by point to form fused features.

2. The remote sensing image segmentation method based on cross-paradigm feature fusion and alignment according to claim 1, characterized in that, The specific content of step S2 includes: S21. Perform linear transformation on the input features through a convolutional layer to extract preliminary semantic features; S22. The initial semantic features are enhanced with a sparse attention mechanism and a spatial alignment strategy to improve the response of key regions, resulting in sparse channel features and sparse channel spatial weights. S23. Enhance the spatial structure perception capability of the initial semantic features at the pixel level, and selectively align the channels to obtain spatial enhancement features and spatial alignment weights; S24. Perform weighted fusion of sparse channel features and spatial enhancement features respectively, and output the fusion result through linear transformation.

3. The remote sensing image segmentation method based on cross-paradigm feature fusion and alignment according to claim 2, characterized in that, The specific content of step S22 includes: S221. The initial semantic features are fed into four parallel sub-branches with different but complementary functions to form a multi-view feature perception mechanism, generating a query matrix Q, a key matrix K, a sparse attention index ρ, and a value matrix V. S222. Construct a sparse attention map, reshape Q and K, perform a dot product to obtain an initial attention matrix, filter key attention scores according to index ρ to obtain a sparse attention map, and apply it to V after Softmax activation to obtain sparse channel features. Then, pass the sparse channel features through convolution, GELU activation, convolution and Sigmoid to obtain channel space fusion weights.

4. The remote sensing image segmentation method based on cross-paradigm feature fusion and alignment according to claim 2, characterized in that, The specific content of step S23 includes: S231. Generate spatial alignment weights: S232. Spatial augmentation features are generated by combining global average pooling with weights to enhance the spatial response: S233. Generating spatial alignment weights based on spatial augmentation features: 。 5. The remote sensing image segmentation method based on cross-paradigm feature fusion and alignment according to claim 1, characterized in that, In step S4, the composite loss includes pixel-level cross-entropy loss, boundary-aware loss, and class balance loss; Where λ1, λ2, and λ3 are weighting coefficients. For pixel-level cross-entropy loss, For boundary-aware loss, This is the category-balanced loss.

6. A remote sensing image segmentation system based on cross-paradigm feature fusion and alignment, characterized in that, A remote sensing image segmentation method based on cross-paradigm feature fusion and alignment according to any one of claims 1-5 includes: an image acquisition module, an initial feature extraction module, a first-stage feature fusion module, a multi-stage cross-paradigm enhanced feature extraction module, and a prediction and optimization module. The image acquisition module is used to acquire remote sensing images and perform preprocessing. The initial feature extraction module is used to extract low-level texture and edge features from the preprocessed remote sensing image through the initial convolutional layer to obtain initial features; The first-stage feature fusion module is used to integrate the multimodal and cross-scale remote sensing image structural information into the cross-paradigm feature fusion and alignment network constructed from the initial feature input to obtain the first-stage fused features. The cross-paradigm feature fusion and alignment network includes a sparse channel enhancement and spatial alignment module as well as a spatial pixel refinement and channel alignment module. The multi-stage cross-paradigm enhanced feature extraction module is used to take the first-stage fused feature input into the constructed multi-stage cross-paradigm enhanced feature extraction network, and through multi-level information interaction and dynamic gating mechanism, fuse local details and global context information to gradually extract joint feature maps that express semantics and spatial structure in a coordinated manner. The prediction and optimization module is used to generate the final semantic segmentation result from the joint feature map through the segmentation head, and to calculate the composite loss based on the real labels to optimize the model parameters.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment as described in any one of claims 1-5.

8. A processing terminal, comprising a memory and a processor, wherein the memory stores a computer program executable on the processor, characterized in that, When the processor executes the computer program, it implements a remote sensing image segmentation method based on cross-paradigm feature fusion and alignment as described in any one of claims 1-5.