A lightweight remote sensing image reference segmentation method, system, device and medium

By combining Taylor pruning, point-by-point weighted attention mechanism, and bidirectional path enhancement module in remote sensing image subscribing technology, efficient and accurate remote sensing image subscribing is achieved on edge devices, solving the problems of redundant model parameters and high computational complexity, and improving segmentation accuracy and efficiency.

CN122265657APending Publication Date: 2026-06-23HUBEI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUBEI UNIV
Filing Date
2026-05-27
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing remote sensing image referencing and segmentation techniques suffer from redundant model parameters, high computational complexity, difficulty in deployment on edge devices with limited computing resources, high cross-modal interactive computation costs, and low efficiency in multi-scale target feature fusion, failing to meet real-time monitoring requirements.

Method used

Taylor pruning is used to lightweight the visual encoder, and cross-modal feature fusion is performed by combining a point-by-point weighted attention mechanism and a bidirectional path enhancement module. A lightweight feature pyramid decoder is used for segmentation prediction.

Benefits of technology

It improves the accuracy of remote sensing image referencing and segmentation with low computational cost, reduces inference latency, enhances multi-scale contextual information interaction, and improves the problems of lost details of small targets and blurred boundaries of large targets, making it suitable for deployment on edge devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265657A_ABST
    Figure CN122265657A_ABST
Patent Text Reader

Abstract

The present application provides a kind of lightweight remote sensing image reference segmentation method, system, equipment and medium, belong to computer vision and remote sensing intelligent processing field, method includes: Taylor pruning is carried out to visual encoder, and the multiscale visual features of remote sensing image are extracted using the visual encoder after pruning;Text encoder is used to extract the reference text features of natural language description;Through point-by-point weighted attention mechanism, the visual features are multiplied with the reference text features element by element, to obtain multiscale cross-modal fusion features;The cross-modal fusion features are fused in a top-down and bottom-up manner using a bidirectional path enhancement module, to obtain enhanced fusion features;Lightweight feature pyramid decoding head is used to decode and segment prediction, to obtain reference segmentation result.The present application effectively reduces the model parameter quantity and the calculation cost, improves the cross-modal alignment and multiscale feature fusion capability, and balances the lightweight and segmentation accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and intelligent processing technology for remote sensing images, specifically to a lightweight remote sensing image referencing and segmentation method, system, device, and medium. Background Technology

[0002] Remote sensing image referencing and segmentation is an emerging cross-modal understanding task. Its goal is to accurately locate and segment target objects in wide-swath, high-resolution remote sensing images based on given natural language descriptions. This technology has extremely high application value in fields such as disaster emergency rescue, urban planning monitoring, military target reconnaissance, and automated construction of geographic information systems.

[0003] However, existing remote sensing indexing and segmentation technologies face challenges in practical applications due to "accuracy" issues. The technical bottleneck of the trade-off between "efficiency" and "performance" manifests itself in the following aspects: First, redundant model parameters make deployment on edge devices difficult. Existing mainstream methods (such as CMPC+, LAVT, etc.) typically rely on massive visual backbone networks (such as ResNet-101 or Swin-Base) and complex self-attention mechanisms to ensure accuracy. For example, the RMSIN model has over 240M parameters and more than 150 GFLOPs of computation. This massive size makes it difficult to deploy the model on edge devices with limited computing resources and power consumption, such as drones, satellite payloads, or handheld terminals. Second, the computational cost of cross-modal interaction. First, existing technologies often employ global multi-head self-attention mechanisms to align text and images. The computational complexity of these mechanisms is quadratic with the image resolution, resulting in high inference latency and failing to meet the demands of real-time monitoring. Second, the multi-scale target feature fusion mechanisms are simplistic. Remote sensing images differ significantly from natural scene images, exhibiting drastic changes in target scale (from vehicles occupying a few pixels to airport runways occupying half the image). Existing feature pyramid (FPN) structures often employ simple element-wise addition or splicing, lacking adaptive judgment of the importance of features at different scales. This leads to the loss of detail in small targets and blurred boundaries in large targets.

[0004] Therefore, the most prominent problem with existing technologies is the redundancy of model computation and the low efficiency of cross-modal and multi-scale fusion, making it difficult to achieve both low computational cost and high segmentation accuracy at the same time. Summary of the Invention

[0005] In view of this, it is necessary to provide a lightweight remote sensing image affixation segmentation method, system, device and medium to solve the technical problem of balancing computational load and segmentation accuracy in the prior art.

[0006] To address the aforementioned technical problems, in a first aspect, the present invention provides a lightweight remote sensing image referencing segmentation method, comprising: Taylor pruning is applied to the visual encoder, and the Taylor-pruned visual encoder is used to extract multi-scale visual features of the remote sensing image; a text encoder is used to extract referential text features of the natural language description associated with the remote sensing image. The multi-scale visual features and the referential text features are multiplied element-wise by a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. A bidirectional path enhancement module is used to perform bidirectional weighted fusion of the multi-scale cross-modal fusion features in a top-down and bottom-up manner to obtain multi-scale enhanced fusion features; A lightweight feature pyramid decoder is used to decode the multi-scale enhanced fusion features to obtain a decoded feature map, and the decoded feature map is then segmented and predicted to obtain the referential segmentation result.

[0007] In one possible implementation, the Taylor pruning of the visual encoder, and the extraction of multi-scale visual features from the remote sensing image using the Taylor-pruned visual encoder, includes: The importance of the coding basic blocks in the visual encoder is scored using a Taylor pruning mechanism scoring function to obtain the importance ranking of each coding basic block; Based on the importance ranking, the basic coding blocks are retained according to a preset pruning ratio to obtain a pruned visual encoder; The pruned visual encoder is used to extract multi-scale visual features of remote sensing images at a preset number of stages.

[0008] In one possible implementation, before performing element-wise multiplication of the multi-scale visual features and the referential text features using a point-by-point weighted attention mechanism, the channel dimension of the referential text features is adjusted to a dimension that matches the visual feature channels.

[0009] In one possible implementation, the step of performing element-wise multiplication of the multi-scale visual features and the referential text features through a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features includes: The visual features at each scale are pre-projected to obtain projected visual features that are consistent with the dimensions of the referential text features; the pre-projection process includes dimension transformation, linear projection, nonlinear activation, and random deactivation. Determine the guiding weight of each text word feature in the referential text features to the projected visual features at the current scale, and perform a weighted summation of each text word feature based on the guiding weight to obtain the text alignment features corresponding to the projected visual features at each scale. The projection visual features and text alignment features at each scale are multiplied element-wise to obtain the initial fusion features at each scale; the initial fusion features at each scale are then subjected to multimodal projection to obtain multi-scale cross-modal fusion features; the multimodal projection includes linear projection, nonlinear activation and random deactivation.

[0010] In one possible implementation, the bidirectional path enhancement module performs top-down and bottom-up bidirectional weighted fusion of the multi-scale cross-modal fusion features to obtain multi-scale enhanced fusion features, including: Following the order from high to low, adjacent cross-modal fusion features in the multi-scale cross-modal fusion features are weighted and fused from top to bottom to obtain multi-scale top-down fusion features. Following the order from low to high levels, adjacent top-down fusion features in the multi-scale top-down fusion features are weighted and fused from bottom to top to obtain multi-scale enhanced fusion features. Both the top-down weighted fusion and the bottom-up weighted fusion use learnable fusion parameters to transmit weighting information, and both use a normalized weight fusion mechanism to adaptively allocate the weights of the features to be fused.

[0011] In one possible implementation, the step of using a lightweight feature pyramid decoding head to decode the multi-scale enhanced fusion features to obtain a decoded feature map includes: The enhanced fusion features described at each scale are mapped to a preset channel dimension; The higher-level enhanced fusion features are upsampled to the size of the adjacent lower-level enhanced fusion features using bilinear interpolation, and then element-wise added and fused with the adjacent lower-level enhanced fusion features. After the elements are added and fused, depthwise separable convolutional blocks are used for smoothing and thinning to obtain multi-scale decoded feature maps.

[0012] In one possible implementation, the step of performing segmentation prediction on the decoded feature map to obtain the referential segmentation result includes: The multi-scale decoded feature maps are uniformly upsampled to the maximum scale in the decoded feature maps, and then channel splicing is performed to obtain the target fusion feature; A classification head is used to map the number of channels of the target fusion features to the number of target categories to obtain the referential segmentation result.

[0013] Secondly, the present invention also provides a lightweight remote sensing image referencing and segmentation system, comprising: The feature extraction module is used to perform Taylor pruning on the visual encoder, and to extract multi-scale visual features of the remote sensing image using the Taylor-pruned visual encoder; and to extract referential text features of the natural language description associated with the remote sensing image using a text encoder. The feature fusion module is used to perform element-wise multiplication of the multi-scale visual features and the referential text features through a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. The feature enhancement module is used to perform top-down and bottom-up bidirectional weighted fusion of the multi-scale cross-modal fusion features using the bidirectional path enhancement module to obtain multi-scale enhanced fusion features; The prediction output module is used to decode the multi-scale enhanced fusion features using a lightweight feature pyramid decoding head to obtain a decoded feature map, and to perform segmentation prediction on the decoded feature map to obtain the referential segmentation result.

[0014] Thirdly, the present invention also provides an electronic device, including a memory and a processor, wherein, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the lightweight remote sensing image referencing segmentation method described in any of the above implementations.

[0015] Fourthly, the present invention also provides a computer-readable storage medium for storing a computer-readable program or instructions, which, when executed by a processor, can implement the steps of the lightweight remote sensing image referencing and segmentation method described in any of the above implementations.

[0016] The beneficial effects of this invention are as follows: The lightweight remote sensing image referencing segmentation method provided by this invention firstly reduces the model size and computational load by performing Taylor pruning on the visual encoder and extracting multi-scale visual features, while simultaneously extracting referencing text features, thus achieving a lightweight encoder while retaining effective features. Secondly, it achieves accurate alignment of text semantics and visual features by fusing visual features and text features through element-wise multiplication using a point-by-point weighted attention mechanism, reducing the computational complexity of cross-modal interaction and inference latency. Subsequently, it performs bidirectional weighted fusion of cross-modal fused features in a top-down and bottom-up manner through a bidirectional path enhancement module, which can adaptively determine the importance of features at different scales, enhance multi-scale contextual information interaction, and improve the problems of lost details of small targets and blurred boundaries of large targets. Finally, it uses a lightweight feature pyramid decoder head to perform feature decoding and complete segmentation prediction, which can improve segmentation accuracy while ensuring low computational load, achieving a balance between lightweight design and high segmentation accuracy. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 A schematic flowchart of an embodiment of the lightweight remote sensing image referencing segmentation method provided by the present invention; Figure 2 A schematic flowchart of another embodiment of the lightweight remote sensing image referencing segmentation method provided by the present invention; Figure 3 This is a schematic diagram of an embodiment of the present invention for extracting multi-scale visual features from remote sensing images; Figure 4 Provided by the present invention Figure 1 A schematic diagram of an embodiment of S102; Figure 5 Provided by the present invention Figure 1 Another embodiment of the process diagram of S102; Figure 6 Provided by the present invention Figure 1 A schematic diagram of an embodiment of S103; Figure 7 A schematic flowchart of an embodiment of the bidirectional path enhancement module provided by the present invention; Figure 8 This is a schematic diagram of an embodiment of the process for obtaining a decoded feature map provided by the present invention; Figure 9 This is a schematic flowchart of an embodiment of the present invention for obtaining a referential segmentation result by segmenting and predicting a decoded feature map; Figure 10 A schematic flowchart of an embodiment of the lightweight feature pyramid decoding head provided by the present invention; Figure 11 A schematic diagram of an embodiment of the remote sensing image referencing segmentation result provided by the present invention; Figure 12 A schematic diagram of an embodiment of the lightweight remote sensing image referencing segmentation system provided by the present invention; Figure 13 A schematic diagram of an embodiment of the electronic device provided by the present invention. Detailed Implementation

[0019] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0020] In the description of the embodiments of the present invention, unless otherwise stated, "multiple" means two or more. "And / or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and / or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.

[0021] The terms "first," "second," etc., used in the embodiments of this invention are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a technical feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature.

[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0023] Before demonstrating the embodiments, the following terms will be explained.

[0024] Reference segmentation: refers to a computer vision task that segments an image to extract the target object referred to by the natural language text description.

[0025] Pruning refers to a model lightweighting technique that removes redundant or unimportant structures (channels, layers, modules, etc.) from the network while maintaining basic model performance, thereby reducing the number of model parameters and computational cost.

[0026] This invention provides a lightweight remote sensing image referencing and segmentation method, system, device, and medium, which are described below.

[0027] Figure 1 The lightweight remote sensing image referencing segmentation method provided by the present invention includes: S101. Perform Taylor pruning on the visual encoder and use the Taylor-pruned visual encoder to extract multi-scale visual features of the remote sensing image; use the text encoder to extract referential text features of the natural language description associated with the remote sensing image. S102. Multi-scale visual features and textual features are multiplied element-wise by a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. S103. A bidirectional path enhancement module is used to perform bidirectional weighted fusion of multi-scale cross-modal fusion features from top to bottom and from bottom to top to obtain multi-scale enhanced fusion features. S104. A lightweight feature pyramid decoding head is used to decode the multi-scale enhanced fusion features to obtain a decoded feature map, and the decoded feature map is segmented and predicted to obtain the referential segmentation result.

[0028] It should be noted that mainstream visual backbone networks in this field typically contain a large number of redundant parameters, resulting in significant model size and computational overhead, which is not conducive to deployment on resource-constrained edge devices. Model pruning is an effective lightweighting technique that can remove redundant parameters to reduce model complexity while preserving the model's main performance. Taylor pruning is a model pruning method based on Taylor expansion. It selectively removes unimportant parameters by evaluating their impact on the loss function, achieving efficient and stable model compression. This improves the model's computational efficiency and inference speed while maintaining feature extraction capabilities.

[0029] In summary, the lightweight remote sensing image referencing segmentation method provided in this invention firstly reduces model size and computational cost by performing Taylor pruning on the visual encoder and extracting multi-scale visual features, while simultaneously extracting referential text features, thus achieving a lightweight encoder while retaining effective features. Secondly, it fuses visual and text features through element-wise multiplication using a point-by-point weighted attention mechanism, achieving precise alignment between text semantics and visual features, reducing the computational complexity of cross-modal interaction, and decreasing inference latency. Subsequently, it performs bidirectional weighted fusion of cross-modal fused features using a bidirectional path enhancement module, which adaptively determines the importance of features at different scales, enhances multi-scale contextual information interaction, and improves the problems of lost details in small targets and blurred boundaries of large targets. Finally, it employs a lightweight feature pyramid decoder head for feature decoding and segmentation prediction, which improves segmentation accuracy while maintaining low computational cost, achieving a balance between lightweight design and high segmentation accuracy.

[0030] In some embodiments of the present invention Figure 2 This is a schematic flowchart of another embodiment of the lightweight remote sensing image referencing segmentation method provided by the present invention; as shown below. Figure 2 As shown, a complete end-to-end network model is constructed, which receives two input data streams in parallel: one stream is the remote sensing image to be processed containing a complex background (…). Figure 2The left-hand side shows a scene of an airplane tarmac, while the other side uses natural language text to describe the target to be segmented ("An airplane at the bottom"). The model aims to accurately segment the corresponding target region in the image based on the text description.

[0031] In some embodiments of the present invention, the visual encoder employs the Swin Transformer to extract visual features at different scales across four stages; the text encoder employs BERT. The Swin Transformer is a hierarchical visual Transformer model that achieves efficient feature extraction through windowed self-attention and shifting window mechanisms. It can output multi-scale features, adapting to the large differences in target scale in remote sensing images, thus providing visual feature representations at different stages for this application. BERT is a pre-trained language model based on a bidirectional Transformer structure. It captures textual semantic information through bidirectional contextual modeling, transforming indicative expressions into context-aware textual features, providing semantic support for cross-modal feature alignment in this application. Figure 2 In the process, for the pruned Swin Transformer visual encoder, after pruning, the number of Swin blocks retained in the four stages are N1, N2, N3, and N4, respectively, where the number of N1, N2, N3, and N4 may be different. The Swin Transformer visual encoder used in this invention comprises four stages, with pre-processing modules for each stage: the first stage uses a patch embedding layer as the pre-processing module, and the second to fourth stages use a patch merging layer as the pre-processing module. Their functions and configuration logic are as follows: The first stage, the block embedding layer, performs initial block division and linear projection on the original input image, transforming image pixels into embedding features that can be processed by the Swin Transformer. At the same time, it achieves initial spatial downsampling, laying the foundation for subsequent feature extraction. The second to fourth stages, the block merging layer, downsamples the features output from the previous stage and expands the channel dimension. While reducing the spatial resolution of the feature map, it increases the number of channels, realizes the hierarchical construction of multi-scale features, and gradually expands the receptive field to adapt to the feature extraction needs of targets at different scales in remote sensing images.

[0032] In some embodiments of the present invention, Taylor pruning is performed on the visual encoder, and the Taylor-pruned visual encoder is used to extract multi-scale visual features of remote sensing images, such as... Figure 3 As shown, it includes: S301. The importance of the coding basic blocks in the visual encoder is scored using the Taylor pruning mechanism scoring function, and the importance ranking of each coding basic block is obtained. S301. Based on importance, retain the basic coding blocks according to the preset pruning ratio to obtain the pruned visual encoder; S301. A pruned visual encoder is used to extract multi-scale visual features of remote sensing images at a preset number of stages.

[0033] In some embodiments of the present invention, before using a Taylor-pruned visual encoder to extract multi-scale visual features from remote sensing images, the input image is subjected to size adaptation and standardization processing to ensure that the image resolution span is adapted to the feature extraction requirements of 1 / 4 to 1 / 32.

[0034] It should be noted that during feature extraction, the visual encoder activates its internally integrated Taylor Block Pruning mechanism. During model training, this mechanism utilizes gradient backpropagation information to evaluate each Swin Transformer block in real time using a scoring function. The importance of ) is shown in the scoring function as shown in Equation 1: (1) in, Represents the total loss function. This represents the output feature of the i-th Swing Block. This involves element-wise multiplication. A score is calculated for each block, identifying redundant modules with low contribution to the final result. During the inference phase, the visual encoder automatically removes low-scoring redundant modules based on this score and outputs four levels of stage features from bottom to top (denoted as...). The resolutions of these four feature layers are 1 / 4, 1 / 8, 1 / 16, and 1 / 32 of the original image, respectively. The number of channels increases with the depth of the layer, corresponding to the shallow geometric details and deep semantic information of the image.

[0035] It should be noted that the method of this invention covers the training, validation, and prediction stages, among which training and validation are conventional techniques in the field and will not be elaborated upon here. The focus is on a detailed description of the model architecture and improvement methods employed.

[0036] This embodiment achieves lightweight Taylor pruning of the visual encoder by scoring and ranking the importance of its basic coding blocks and selecting and retaining key coding blocks according to a preset pruning ratio. This method can effectively reduce redundant parameters and computational load while preserving the multi-scale feature extraction capability of the visual encoder, providing lightweight and effective visual feature support for subsequent cross-modal fusion and segmentation prediction. It solves the problems of redundant parameters and difficulty in deployment on edge devices in existing remote sensing indexing and segmentation models.

[0037] In some embodiments of the present invention, before performing element-wise multiplication of multi-scale visual features and referential text features through a point-by-point weighted attention mechanism, the channel dimension of the referential text features is adjusted to a dimension that matches the visual feature channel.

[0038] In some embodiments of the present invention, such as Figure 2 As shown, the text projection layer is ( The text projection layer receives the raw text features output by the text encoder. To achieve cross-modal alignment, the text projection layer performs a linear mapping operation to adjust the channel dimensions of the text features to match the dimensions of the visual feature channels, outputting unified text embedding features.

[0039] This embodiment linearly maps the referential text features through a text projection layer, adjusting the channel dimension of the text features to match the dimension of the visual features. This achieves dimensional alignment of cross-modal features, providing a basis for dimensional matching for the element-wise multiplication operation of the subsequent point-by-point weighted attention mechanism. It ensures the smooth cross-modal fusion of multi-scale visual features and referential text features, and improves the effectiveness and stability of feature interaction.

[0040] In some embodiments of the present invention, a point-by-point weighted attention mechanism is used to perform element-by-element multiplication of multi-scale visual features and referential text features to obtain multi-scale cross-modal fusion features, such as... Figure 4 As shown, it includes: S401. Perform pre-projection processing on the visual features at each scale to obtain projected visual features that are consistent with the dimensions of the referential text features; the pre-projection processing includes dimension transformation, linear projection, non-linear activation and random deactivation; S402. Determine the guiding weight of each text word feature in the referential text features to the projected visual features at the current scale, and perform a weighted summation of each text word feature based on the guiding weight to obtain the text alignment features corresponding to the projected visual features at each scale. S403. Perform element-wise multiplication on the projected visual features and text alignment features at each scale to obtain the initial fusion features at each scale; perform multimodal projection on the initial fusion features at each scale to obtain multimodal cross-scale fusion features; multimodal projection includes linear projection, nonlinear activation and random deactivation.

[0041] In some embodiments of the present invention, such as Figure 2 As shown, the Point-wise Weighted Attention Mechanism (PWAM) module receives data from the two sources mentioned above. To handle multi-scale information, four parallel PWAM modules are configured, each receiving visual stage features from one level. It serves as the main input and receives the aforementioned unified text embedding features as auxiliary input.

[0042] In some embodiments of the present invention, the internal processing flow of the point-by-point weighted attention mechanism is as follows: Figure 5 As shown: The input visual features are first processed through a 1×1 convolutional layer (linear projection) for dimensionality reduction or feature recombination. Then, a non-linear activation function (GELU) is introduced, followed by a Dropout layer (with a dropout rate of 0.1) to prevent overfitting, generating the query feature Q. The input text features undergo linear transformations to generate keys K and values ​​V. Q and K are then input to the Image-LanguageAttention (ILA) unit. This unit calculates the text semantic response heatmap in the image space, identifying which regions in the image are most relevant to the text description. The generated attention weights are combined with the text value V and then element-wise multiplied with the original visual query feature Q. This achieves "point-by-point weighting" of visual features based on textual semantics, suppressing background noise irrelevant to the description. The fused features are then passed through a multimodal projection layer consisting of 1×1 convolutions, GELU, and Dropout, outputting four levels of cross-modal fused features, denoted as... .

[0043] This embodiment uses a point-wise weighted attention mechanism (PWAM) to adapt multi-scale visual features in parallel modules. First, the visual features are pre-projected to a dimension that is consistent with the text features. Then, the guiding weights of the text word features are calculated and aligned features are generated. Cross-modal fusion is achieved through element-wise dot product. At the same time, multi-modal projection processing is used to reduce computational complexity and provide efficient and robust cross-modal fusion features for subsequent segmentation tasks.

[0044] In some embodiments of the present invention, a bidirectional path enhancement module is used to perform top-down and bottom-up bidirectional weighted fusion of multi-scale cross-modal fusion features to obtain multi-scale enhanced fusion features, such as... Figure 6 As shown, it includes: S601. Following the order from high to low, perform top-down weighted fusion on adjacent cross-modal fusion features in the multi-scale cross-modal fusion features to obtain multi-scale top-down fusion features. S602. Following the order from low to high levels, perform bottom-up weighted fusion on adjacent top-down fusion features in the multi-scale top-down fusion features to obtain multi-scale enhanced fusion features. S603, both top-down weighted fusion and bottom-up weighted fusion use learnable fusion parameters to pass weighting information, and both use a normalized weight fusion mechanism to adaptively allocate the weights of the features to be fused.

[0045] In some embodiments of the present invention, the fused features P1, P2, P3, and P4 are fed into the enhancement structure of the Bidirectional Hierarchical Path Module (BHPM). This structure is based on the BiFPN (Bidirectional Feature Pyramid) principle and introduces learnable normalized weights at each fusion node. Its core fusion formula is shown in Equation 2: (2) in, , These are the scalar weights that the network automatically learns during training. To prevent the smallest constant from being divided by zero ( This mechanism allows the network to automatically determine the importance of features at different scales to the current task.

[0046] In some embodiments of the present invention, the implementation details of the bidirectional path enhancement module are as follows: Figure 7 As shown: First, initiate the top-down data flow. The data flow starts from the highest level. Begin. After upsampling to The resolution, weighted fusion nodes are calculated according to Equation 2. As shown in Equation 3: (3) This process is repeated to generate intermediate features containing strong semantic information. As shown in Equations 4 and 5: (4) (5) This process transmits strong semantic information from higher levels to lower levels, enhancing the model's ability to locate targets.

[0047] Then, a bottom-up path data flow is initiated, which will... After downsampling (usually using max pooling or strided convolution) to resolution, and intermediate features Weighted fusion is performed, and the weighted fusion nodes are calculated based on Equation 2, using top-down features. As shown in Equation 6: (6) Calculate upwards sequentially and As shown in Equations 7 and 8: (7) (8) By utilizing the fine-grained texture information at the bottom layer to repair the spatial details of high-level features, the final output is a four-layer feature pyramid with bidirectional enhancement. , , , Four layers of features.

[0048] This embodiment uses a bidirectional path enhancement module to perform top-down and bottom-up weighted fusion of multi-scale cross-modal fusion features, and adopts learnable fusion parameters and normalized weights for adaptive allocation. It fully integrates high- and low-level semantic and detailed information, effectively improving the expressive power and robustness of multi-scale features, and providing more accurate enhanced features for subsequent segmentation output.

[0049] In some embodiments of the present invention, a lightweight feature pyramid decoding head is used to decode multi-scale enhanced fusion features to obtain a decoded feature map, such as... Figure 8 As shown, it includes: S801. Map all scale-enhanced fusion features to the preset channel dimension; S802. Upsample the high-level enhanced fusion features to the size of the adjacent low-level enhanced fusion features through bilinear interpolation, and then perform element-wise addition and fusion with the adjacent low-level enhanced fusion features; S803. After element addition and fusion, depthwise separable convolutional blocks are used for smoothing and thinning to obtain multi-scale decoded feature maps.

[0050] It should be noted that the lightweight feature pyramid decoder of this invention achieves its lightweight nature by replacing traditional standard convolution with depthwise separable convolution. Specifically, depthwise separable convolution decomposes ordinary convolution into two independent operations: channel-wise spatial convolution and pointwise channel fusion, significantly reducing the number of parameters and computational cost. At the same time, combined with a simple fusion method of unified feature channel mapping and element-wise addition after upsampling, it significantly reduces the model's computational overhead and number of parameters while ensuring multi-scale feature decoding performance.

[0051] This embodiment uses a lightweight feature pyramid decoding head to uniformly map multi-scale enhanced fusion features to a preset channel dimension. After upsampling and element-wise addition and fusion, depthwise separable convolutional blocks are used for smoothing and refinement. This reduces the computational load in the decoding stage while effectively fusing multi-scale contextual information, thereby improving the detail and semantic expression capabilities of the decoded feature map.

[0052] In some embodiments of the present invention, segmentation prediction is performed on the decoded feature map to obtain the referential segmentation result, such as... Figure 9 As shown, it includes: S901. Upsample the multi-scale decoded feature maps to the maximum scale in the decoded feature maps and perform channel splicing to obtain the target fusion features; S902. The classification head is used to map the number of channels of the target fusion feature to the number of target categories to obtain the reference segmentation result.

[0053] This embodiment upsamples the multi-scale decoded feature maps to the maximum scale and performs channel splicing, fully integrating semantic and detailed information from different levels. The final segmentation result is then obtained through classification head mapping, which can improve the segmentation accuracy of small targets and complex scenes, and ensure the integrity and accuracy of the referential segmentation results.

[0054] In some embodiments of the present invention Figure 10 This is a schematic flowchart of an embodiment of the lightweight feature pyramid decoding head provided by the present invention; as follows: Figure 10 As shown, to ensure lightweight design, this structure first utilizes lateral connection units (composed of 1×1 convolutions) to... , , , The number of channels in the four-layer features is uniformly compressed to a preset value (256 channels) to eliminate the difference in the number of channels; Figure 10 In the diagram, C1, C2, C3, and C4 represent... , , , ; Next, feature aggregation of Lite FPN is performed, using a top-down logic. High-level features are upsampled by a factor of 2 through bilinear interpolation and then element-wise added to adjacent low-level features. Each fused feature layer then enters the feature refinement unit in the LiteFPN Head structure. This feature refinement unit consists of a depthwise separable convolutional block (DWConvBlock), which sequentially extracts spatial features through 3×3 depthwise convolution, fuses channel information through 1×1 pointwise convolution, normalizes the data distribution through batch normalization, and activates the features through ReLU. After refinement, four feature layers are obtained. Finally, the data is fed into the classification head. The classification head first performs an upsampling operation, uniformly adjusting the features from all levels to the highest resolution (i.e., ...). The scale is determined by the number of channels in the fused feature x, and a channel splicing operation is performed (the number of channels in the spliced ​​fused feature x is equal to 4 times the channel dimension of the FPN, with a default of 1024 dimensions), generating a high-dimensional fused feature.

[0055] The classification head then utilizes its internal DWConvBlock and 1×1 convolutional layers to map the high-dimensional fused features to the number of target class channels (typically 2, i.e., foreground and background). Finally, the model outputs a pixel-level segmentation mask corresponding to the input image. After processing the mask using the Sigmoid function, regions with pixel values ​​greater than a threshold are identified as specific remote sensing targets as indicated in natural language (e.g., ...). Figure 2 The airplane marked in red on the right.

[0056] In some embodiments of the present invention Figure 11 This is a schematic diagram of an embodiment of the remote sensing image referencing segmentation result provided by the present invention. Figure 11 It includes four typical test scenarios, covering targets at different scales. The top left corner of each scenario displays a natural language index, which describes the location and category of the target to be segmented. The four images in each scenario from left to right are the original remote sensing image, the ground truth (GT, i.e., the manually labeled segmentation mask of the target), the prediction result (i.e., the segmentation mask obtained by using the lightweight remote sensing image index segmentation method of this invention), and the overlay view (i.e., the overlay display of the prediction result and the original image, which can intuitively present the accuracy of target localization and segmentation).

[0057] In some embodiments of this invention, to verify the effectiveness of the method, the RRSIS-D dataset, a commonly used dataset in the field of remote sensing referential segmentation, was selected for comparative experiments. This dataset contains diverse remote sensing scenes (such as airports, ports, cities, etc.) and corresponding complex natural language descriptions, enabling a comprehensive evaluation of the model's segmentation capabilities across different scales and complex backgrounds. The experimental results are shown in Table 1: Table 1: Comparison of performance metrics of various models on the RRSIS-D dataset

[0058] It should be noted that among the comparison methods used, CMPC+ is a classic indexing segmentation method based on cross-modal contrastive learning; LGCE is a language-guided cross-scale enhancement method for remote sensing scenes; LAVT is a high-precision indexing segmentation method based on Transformer; and RMSIN is a mainstream remote sensing indexing segmentation method for the RRSIS-D dataset. These methods represent traditional classic baselines, remote sensing-specific methods, Transformer-based high-precision methods, and the current best baseline, respectively. Selecting them for comparison comprehensively verifies the advancement and effectiveness of this invention in terms of lightweight design, multi-scale fusion, and cross-modal alignment. While pruning is suitable for extreme model compression and customized optimization, for text encoding scenarios, directly using lightweight pre-trained models such as BERT-Tiny and BERT-Small is more efficient and reliable in practical engineering applications, achieving lightweighting while ensuring semantic expressiveness.

[0059] Among the selected metrics, parameter count (M) represents the total number of learnable parameters of the model, used to measure the lightweight nature of the model; computational cost (MACs) represents the amount of computation required for inference, reflecting the computational overhead and running speed of the model; oIoU is the overall IoU, and mIoU is the mean IoU. Both are dimensionless and are used to measure segmentation accuracy. The higher the value, the better the segmentation effect.

[0060] According to the experimental data in Table 1, the computational cost (MACs) of existing techniques such as RMSIN and LGCE are 154.3G and 200.25G, respectively. In contrast, the computational cost of the method described in this invention (taking the Swin-B + BERT-T configuration as an example) on the RRSIS-D dataset is only 54.84G. This indicates that, thanks to Taylor block pruning and the design of LiteFPN, this invention reduces computational cost by approximately 60%-70%. While significantly reducing computational cost, the method of this invention still maintains highly competitive accuracy. With the Swin-B + BERT-S configuration, the mIoU of this invention reaches 64.41%, and the oIoU reaches 77.83%, which is superior to CMPC+ (mIoU 51.41%), LGCE (mIoU 60.16%), and LAVT (mIoU 61.46%).

[0061] Experimental data based on the RRSIS-D dataset show that this invention, through the organic combination of point-by-point weighted attention mechanism, bidirectional path enhancement module and lightweight feature pyramid decoding head, effectively improves the accuracy of remote sensing image referencing segmentation while significantly reducing model computational load (MACs), achieving the best balance between speed and accuracy.

[0062] This embodiment significantly reduces the number of model parameters and computational cost by using a Taylor-pruned compressed visual encoder combined with a lightweight BERT text encoder. It achieves precise alignment of visual and textual features through point-wise weighted attention (PWAM) to reduce cross-modal interaction overhead. A bidirectional path enhancement module adaptively fuses multi-scale context to enhance the complementarity of high- and low-level features. Finally, a lightweight feature pyramid decoder efficiently decodes the data, ultimately ensuring high accuracy and robustness of remote sensing image referencing and segmentation while significantly improving the lightweight design, achieving a balance between speed and performance.

[0063] To better implement the lightweight remote sensing image referencing segmentation method in this embodiment of the invention, based on the lightweight remote sensing image referencing segmentation method, correspondingly, as follows: Figure 12 As shown, this embodiment of the invention also provides a lightweight remote sensing image referencing segmentation system. The lightweight remote sensing image referencing segmentation system 1200 includes: The feature extraction module 1201 is used to perform Taylor pruning on the visual encoder, and to extract multi-scale visual features of the remote sensing image using the Taylor-pruned visual encoder; and to extract referential text features of the natural language description associated with the remote sensing image using a text encoder. The feature fusion module 1202 is used to perform element-wise multiplication of multi-scale visual features and referential text features through a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. The feature enhancement module 1203 is used to perform top-down and bottom-up bidirectional weighted fusion of multi-scale cross-modal fusion features using a bidirectional path enhancement module to obtain multi-scale enhanced fusion features; The prediction output module 1204 is used to decode multi-scale enhanced fusion features using a lightweight feature pyramid decoding head to obtain a decoded feature map, and to perform segmentation prediction on the decoded feature map to obtain the referential segmentation result.

[0064] like Figure 13 As shown, the present invention also provides an electronic device 1300. The electronic device 1300 includes a processor 1301, a memory 1302, and a display 1303. Figure 13 Only some components of the electronic device 1300 are shown, but it should be understood that it is not required to implement all of the components shown, and more or fewer components may be implemented instead.

[0065] In some embodiments, processor 1301 may be a central processing unit (CPU), microprocessor, or other data processing chip, used to run program code stored in memory 1302 or process data, such as the lightweight remote sensing image referencing segmentation method of the present invention.

[0066] In some embodiments, processor 1301 may be a single server or a group of servers. The server group may be centralized or distributed. In some embodiments, processor 1301 may be local or remote. In some embodiments, processor 1301 may be implemented on a cloud platform. In one embodiment, the cloud platform may include a private cloud, public cloud, hybrid cloud, community cloud, distributed cloud, internal cloud, multi-cloud, etc., or any combination thereof.

[0067] In some embodiments, memory 1302 may be an internal storage unit of electronic device 1300, such as a hard disk or memory of electronic device 1300. In other embodiments, memory 1302 may also be an external storage device of electronic device 1300, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. equipped on electronic device 1300.

[0068] Furthermore, the memory 1302 may include both internal storage units of the electronic device 1300 and external storage devices. The memory 1302 is used to store application software and various types of data installed on the electronic device 1300.

[0069] In some embodiments, display 1303 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. Display 1303 is used to display information from electronic device 1300 and to display a visual user interface. Components 1301-1303 of electronic device 1300 communicate with each other via a system bus.

[0070] In one embodiment, when processor 1301 executes the lightweight remote sensing image denotation segmentation program in memory 1302, the following steps can be implemented: Taylor pruning is applied to the visual encoder, and the Taylor-pruned visual encoder is used to extract multi-scale visual features of the remote sensing image; a text encoder is used to extract referential text features of natural language descriptions associated with the remote sensing image. Multi-scale cross-modal fusion features are obtained by performing element-wise multiplication of multi-scale visual features and referential text features through a point-by-point weighted attention mechanism. A bidirectional path enhancement module is used to perform bidirectional weighted fusion of multi-scale cross-modal fusion features in a top-down and bottom-up manner to obtain multi-scale enhanced fusion features; A lightweight feature pyramid decoding head is used to decode multi-scale enhanced fusion features to obtain a decoded feature map, and the decoded feature map is then used for segmentation prediction to obtain the referential segmentation result.

[0071] It should be understood that when the processor 1301 executes the lightweight remote sensing image denotation and segmentation program in the memory 1302, in addition to the functions mentioned above, it can also perform other functions, as can be found in the description of the corresponding method embodiments above.

[0072] Furthermore, the embodiments of the present invention do not specifically limit the type of electronic device 1300 mentioned. Electronic device 1300 can be a mobile phone, tablet computer, personal digital assistant (PDA), wearable device, laptop computer, or other portable electronic device. Exemplary embodiments of portable electronic devices include, but are not limited to, portable electronic devices running iOS, Android, Microsoft, or other operating systems. The aforementioned portable electronic device can also be other portable electronic devices, such as a laptop computer with a touch-sensitive surface (e.g., a touch panel). It should also be understood that in some other embodiments of the present invention, electronic device 1300 may not be a portable electronic device, but rather a desktop computer with a touch-sensitive surface (e.g., a touch panel).

[0073] Accordingly, embodiments of this application also provide a computer-readable storage medium for storing computer-readable programs or instructions. When the programs or instructions are executed by a processor, they can implement the steps or functions of the lightweight remote sensing image referencing and segmentation method provided in the above-described method embodiments.

[0074] Those skilled in the art will understand that all or part of the processes of the methods described in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.), and the computer program can be stored in a computer-readable storage medium. The computer-readable storage medium may be a disk, optical disk, read-only memory, or random access memory, etc.

[0075] The lightweight remote sensing image representation and segmentation method, apparatus, electronic device, and storage medium provided by the present invention have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.

Claims

1. A lightweight remote sensing image referencing segmentation method, characterized in that, include: Taylor pruning is applied to the visual encoder, and the Taylor-pruned visual encoder is used to extract multi-scale visual features from remote sensing images. A text encoder is used to extract referential text features of the natural language description associated with the remote sensing image; The multi-scale visual features and the referential text features are multiplied element-wise by a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. A bidirectional path enhancement module is used to perform bidirectional weighted fusion of the multi-scale cross-modal fusion features in a top-down and bottom-up manner to obtain multi-scale enhanced fusion features; A lightweight feature pyramid decoder is used to decode the multi-scale enhanced fusion features to obtain a decoded feature map, and the decoded feature map is then segmented and predicted to obtain the referential segmentation result.

2. The lightweight remote sensing image referencing segmentation method according to claim 1, characterized in that, The process of performing Taylor pruning on the visual encoder and extracting multi-scale visual features from the remote sensing image using the Taylor-pruned visual encoder includes: The importance of the coding basic blocks in the visual encoder is scored using a Taylor pruning mechanism scoring function to obtain the importance ranking of each coding basic block; Based on the importance ranking, the basic coding blocks are retained according to a preset pruning ratio to obtain a pruned visual encoder; The pruned visual encoder is used to extract multi-scale visual features of remote sensing images at a preset number of stages.

3. The lightweight remote sensing image referencing segmentation method according to claim 1, characterized in that, Before performing element-wise multiplication of multi-scale visual features and referential text features using the point-by-point weighted attention mechanism, the channel dimension of the referential text features is adjusted to match the dimension of the visual feature channels.

4. The lightweight remote sensing image referencing segmentation method according to claim 3, characterized in that, The step of performing element-wise multiplication of the multi-scale visual features and the referential text features through a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features includes: The visual features at each scale are pre-projected to obtain projected visual features that are consistent with the dimensions of the referential text features; the pre-projection process includes dimension transformation, linear projection, nonlinear activation, and random deactivation. Determine the guiding weight of each text word feature in the referential text features to the projected visual features at the current scale, and perform a weighted summation of each text word feature based on the guiding weight to obtain the text alignment features corresponding to the projected visual features at each scale. The projection visual features and text alignment features at each scale are multiplied element-wise to obtain the initial fusion features at each scale; the initial fusion features at each scale are then subjected to multimodal projection to obtain multi-scale cross-modal fusion features; the multimodal projection includes linear projection, nonlinear activation and random deactivation.

5. The lightweight remote sensing image referencing segmentation method according to claim 1, characterized in that, The multi-scale cross-modal fusion features are subjected to bidirectional weighted fusion from top to bottom and bottom to top using a bidirectional path enhancement module to obtain multi-scale enhanced fusion features, including: Following the order from high to low, adjacent cross-modal fusion features in the multi-scale cross-modal fusion features are weighted and fused from top to bottom to obtain multi-scale top-down fusion features. Following the order from low to high levels, adjacent top-down fusion features in the multi-scale top-down fusion features are weighted and fused from bottom to top to obtain multi-scale enhanced fusion features. Both the top-down weighted fusion and the bottom-up weighted fusion use learnable fusion parameters to transmit weighting information, and both use a normalized weight fusion mechanism to adaptively allocate the weights of the features to be fused.

6. The lightweight remote sensing image referencing segmentation method according to claim 1, characterized in that, The process of using a lightweight feature pyramid decoding head to decode the multi-scale enhanced fusion features to obtain a decoded feature map includes: The enhanced fusion features described at each scale are mapped to a preset channel dimension; The higher-level enhanced fusion features are upsampled to the size of the adjacent lower-level enhanced fusion features using bilinear interpolation, and then element-wise added and fused with the adjacent lower-level enhanced fusion features. After the elements are added and fused, depthwise separable convolutional blocks are used for smoothing and thinning to obtain multi-scale decoded feature maps.

7. The lightweight remote sensing image referencing segmentation method according to claim 1, characterized in that, The step of segmenting and predicting the decoded feature map to obtain the referential segmentation result includes: The multi-scale decoded feature maps are uniformly upsampled to the maximum scale in the decoded feature maps, and then channel splicing is performed to obtain the target fusion feature; A classification head is used to map the number of channels of the target fusion features to the number of target categories to obtain the referential segmentation result.

8. A lightweight remote sensing image referencing and segmentation system, characterized in that, include: The feature extraction module is used to perform Taylor pruning on the visual encoder, and to extract multi-scale visual features of remote sensing images using the Taylor-pruned visual encoder. A text encoder is used to extract referential text features of the natural language description associated with the remote sensing image; The feature fusion module is used to perform element-wise multiplication of the multi-scale visual features and the referential text features through a point-by-point weighted attention mechanism to obtain multi-scale cross-modal fusion features. The feature enhancement module is used to perform top-down and bottom-up bidirectional weighted fusion of the multi-scale cross-modal fusion features using the bidirectional path enhancement module to obtain multi-scale enhanced fusion features; The prediction output module is used to decode the multi-scale enhanced fusion features using a lightweight feature pyramid decoding head to obtain a decoded feature map, and to perform segmentation prediction on the decoded feature map to obtain the referential segmentation result.

9. An electronic device, characterized in that, Including memory and processor, among which, The memory is used to store programs; The processor, coupled to the memory, is used to execute the program stored in the memory to implement the steps in the lightweight remote sensing image referencing and segmentation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer-readable programs or instructions, which, when executed by a processor, can implement the steps in the lightweight remote sensing image referencing and segmentation method according to any one of claims 1 to 7.