Remote sensing image semantic segmentation method and system based on self-training large model
By using a self-trained large model with dynamic instruction parsing and a self-training loop mechanism, the flexibility and applicability of remote sensing image semantic segmentation methods in dynamically changing tasks are solved. This enables fine-grained segmentation of complex scenes without retraining, reducing costs and improving accuracy and stability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHENGZHOU UNIVERSITY OF AERONAUTICS
- Filing Date
- 2026-03-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing remote sensing image semantic segmentation methods are difficult to flexibly adapt to fine-grained segmentation tasks that are dynamically changing and expressed by users through natural language commands. Furthermore, they require the re-collection of labeled data and retraining of models, which is costly and limits the flexibility and applicability of the technology.
We employ a self-trained large model approach, utilizing dynamic instruction parsing, efficient parameter mixing and fine-tuning, pixel-level subspace alignment, and a self-training loop mechanism to generate high-confidence instruction-segmentation mask pairing samples from unlabeled data. This expands the training set and updates the model parameters, enabling us to adapt to dynamic segmentation tasks without retraining.
It significantly improves the flexibility and applicability of semantic segmentation of remote sensing images, reduces the cost of manual annotation, and improves the accuracy of target feature identification and the stability of segmentation results.
Smart Images

Figure CN122244703A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of remote sensing image processing technology, specifically to a method and system for semantic segmentation of remote sensing images based on a self-trained large model. Background Technology
[0002] With the rapid development of Earth observation technology, the amount of high-resolution remote sensing imagery data has exploded, making the automated and intelligent semantic segmentation of remote sensing images using artificial intelligence technology a research hotspot. Semantic segmentation aims to classify each pixel in an image to understand its detailed content, and it has significant application value in many fields such as land use monitoring, environmental change detection, and urban planning. Currently, this technology is gradually shifting from traditional methods based on manually designed features to methods based on deep learning.
[0003] In the existing technology, there are many invention patents involving remote sensing image classification and segmentation. For example, invention patent CN113239815A discloses a remote sensing image classification method, device and equipment based on real semantic full network learning, which improves the classification effect by constructing a lightweight semantic heuristic encoding and decoding network model and using one-to-one scale texture features.
[0004] For example, the invention patent with publication number CN112991354A describes a high-resolution remote sensing image semantic segmentation method based on deep learning. It aims to enhance the utilization rate of features at all levels and improve segmentation accuracy by constructing a complex model that includes a data pre-training module, a feature encoding layer, a channel feature re-correction module, a feature decoding layer, a multi-level feature extraction module, a feature fusion layer, and a classification module.
[0005] While existing technologies, such as the aforementioned patents, have made progress in structural design, they typically rely on fixed category predefined structures and static network architectures. These methods generally face a common challenge: once the model is trained, the types of land features it can identify are fixed, lacking the ability to understand and respond to dynamic and complex natural language commands. This means that in practical applications, if user needs exceed the predefined category scope or involve complex scene combinations, it is necessary to re-collect labeled data and retrain the model, a cumbersome and costly process that significantly limits the flexibility and applicability of the technology.
[0006] Therefore, existing remote sensing image semantic segmentation methods struggle to flexibly adapt to and accurately complete dynamically changing fine-grained segmentation tasks expressed by users through natural language commands without requiring retraining. This invention aims to solve this key technical problem. Summary of the Invention
[0007] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a method and system for semantic segmentation of remote sensing images based on a self-trained large model. Through dynamic instruction parsing, efficient parameter mixing and fine-tuning, pixel-level subspace alignment, and a self-training loop mechanism, the model can understand and respond to the user's natural language instructions, enabling it to adapt to dynamic and fine-grained semantic segmentation tasks without retraining. At the same time, it effectively utilizes unlabeled data to improve model performance, significantly reduces manual annotation costs, and enhances the system's flexibility and practicality.
[0008] To solve the above-mentioned technical problems, the present invention provides the following technical solution: On the one hand, a remote sensing image semantic segmentation method based on a self-trained large model includes the following steps:
[0009] Step 1: Acquire the remote sensing image to be segmented and the natural language instructions input by the user;
[0010] Step 2: Use the dynamic instruction parsing module to parse the natural language instructions and generate a structured task description vector;
[0011] Step 3: Input the remote sensing images and task description vectors into a large model that has been fine-tuned by efficient parameter mixing. The large model is built based on a pre-trained visual encoder and a language encoder and fine-tuned by low-rank adaptation and zero-initialization gating adapter.
[0012] Step 4: Perform pixel-level subspace alignment in the fusion feature space of the large model. Through contrastive learning, the distance between the task description vector and the corresponding target pixel feature is shortened, while the distance between the task description vector and the non-target pixel feature is widened.
[0013] Step 5: Based on the aligned fused features, generate the initial semantic segmentation result using a segmentation mask decoder;
[0014] Step 6: Start the self-training loop, perform inference on the large model on unlabeled remote sensing image data, filter out high-confidence segmentation results and automatically generate corresponding instruction-segmentation mask pairing samples;
[0015] Step 7: Expand the training set using the newly generated instruction-segmentation mask paired samples, update the parameters of the large model, and achieve the co-evolution of instruction understanding ability and segmentation ability;
[0016] Step 8: Output the final optimized semantic segmentation results of the remote sensing image.
[0017] Furthermore, the dynamic instruction parsing module in step two specifically parses natural language instructions, including:
[0018] A lightweight language model is used to decompose natural language instructions into multiple semantic sub-conditions, which include spatial location, target object, and state condition.
[0019] Each semantic subcondition is transformed into a corresponding feature vector through a learnable mapping network;
[0020] The feature vectors of all semantic sub-conditions are combined into a structured task description vector.
[0021] Furthermore, the efficient parameter fine-tuning in step three specifically involves:
[0022] Inject low-rank adaptation adapters into the attention layers of the pre-trained visual encoder and language encoder;
[0023] Only train the parameters of the low-rank adaptive adapter;
[0024] Before fusing visual and linguistic features, a zero-initialization gating adapter is introduced, which is initialized to zero to ensure that the fused features are consistent with the output of the pre-trained model in the early stages of training.
[0025] Furthermore, the pixel-level subspace alignment in step four specifically includes:
[0026] Calculate the cosine similarity between the feature vector of each pixel in the fused feature map and the task description vector;
[0027] Construct a contrastive learning loss function, which maximizes the similarity between target pixel features and task description vector, while minimizing the similarity between non-target pixel features and task description vector.
[0028] Furthermore, the initiation of the self-training loop in step six specifically includes:
[0029] The large model is inferred on an unlabeled remote sensing image database to generate a predictive segmentation mask.
[0030] Calculate the uncertainty estimate for each region in the predicted segmentation mask;
[0031] Based on the uncertainty estimate and the multi-view consistency verification strategy, the predicted segmentation mask with a confidence level higher than the preset threshold is selected as a high-confidence pseudo-label.
[0032] For each high-confidence pseudo-label, a corresponding natural language description is automatically generated to form a new instruction-segmentation mask pairing sample.
[0033] Furthermore, the multi-perspective consistency verification strategy specifically includes:
[0034] Applying multiple different data augmentation transformations to the same unlabeled remote sensing image generates multiple augmented views;
[0035] Multiple enhanced views are input into a large model to obtain multiple prediction results;
[0036] Calculate the degree of consistency among multiple prediction results;
[0037] Consistency level is used as one of the indicators for evaluating the confidence level of false labels.
[0038] On the other hand, the remote sensing image semantic segmentation system based on a self-trained large model is applicable to the remote sensing image semantic segmentation method based on a self-trained large model as described in the claims, including:
[0039] The data acquisition module is used to acquire the remote sensing images to be segmented and the natural language commands input by the user;
[0040] The instruction parsing module is used to dynamically parse natural language instructions and generate structured task description vectors;
[0041] The model processing module includes a large model that has been fine-tuned by efficient parameter mixing, used for feature extraction and fusion of input remote sensing images and task description vectors;
[0042] The pixel alignment module is used to perform pixel-level subspace alignment operations in the fusion feature space of a large model.
[0043] The segmentation and decoding module is used to generate a semantic segmentation mask based on the aligned fused features;
[0044] The self-training module manages the self-training loop, including generating pseudo-labels, creating instruction-segmentation mask paired samples, and updating model parameters.
[0045] The results output module is used to output the final semantic segmentation results.
[0046] Furthermore, the large model in the model processing module uses a visual Transformer as the backbone network of the visual encoder;
[0047] The language encoder uses a distilled BERT model;
[0048] The visual encoder and the speech encoder are connected and their parameters are fine-tuned through a low-rank adaptation adapter and a zero-initialization gating adapter.
[0049] Furthermore, the self-training module includes:
[0050] The pseudo-label generation unit is used to generate predictions and calculate uncertainty estimates on unlabeled data;
[0051] The sample pairing unit is used to automatically generate corresponding natural language instructions for high-confidence pseudo-labels, forming new training samples;
[0052] The model update unit is used to incrementally train the large model using newly generated training samples.
[0053] In another aspect, a computer-readable storage medium is provided for the aforementioned remote sensing image semantic segmentation method based on a self-trained large model, wherein a computer program is stored thereon, which, when executed by a processor, implements the steps of the method.
[0054] Compared with existing technologies, this remote sensing image semantic segmentation method and system based on a self-trained large model has the following advantages:
[0055] I. This invention performs semantic decomposition and feature mapping on natural language instructions through a dynamic instruction parsing module, generating structured task description vectors. Combined with a large model built based on a pre-trained visual encoder and a language encoder, it achieves efficient hybrid fine-tuning of parameters through low-rank adaptation technology and zero-initialization gating adapter. This allows the model to accurately understand the user's dynamically changing fine-grained segmentation requirements without re-collecting labeled data and retraining, effectively adapting to segmentation tasks that exceed predefined land cover categories. This significantly improves the flexibility and practical applicability of remote sensing image semantic segmentation technology.
[0056] Second, this invention improves the accuracy of target feature recognition by using pixel-level subspace alignment operations and contrastive learning to bring the task description vector closer to the target pixel features and push away the non-target pixel features. At the same time, through a self-training loop mechanism, it makes full use of unlabeled remote sensing image data to generate high-confidence instruction-segmentation mask pairing samples, expands the training set and incrementally updates the model, realizing the co-evolution of the model's instruction understanding ability and segmentation ability, greatly reducing the dependence on manually labeled data, reducing the cost of model optimization, and further ensuring the stability and reliability of semantic segmentation results.
[0057] Other advantages, objectives and features of the invention will be set forth in part in the description which follows, and in part will be apparent to those skilled in the art from the following examination or study, or may be learned from the practice of the invention. Attached Figure Description
[0058] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.
[0059] Figure 1 This is a flowchart of the steps of the present invention;
[0060] Figure 2 This is a system architecture diagram of the present invention;
[0061] Figure 3 This is a diagram of the self-training loop mechanism of the present invention. Detailed Implementation
[0062] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided below.
[0063] Example 1
[0064] like Figures 1 to 3 As shown, this basic embodiment corresponds to the remote sensing image semantic segmentation method based on a self-trained large model as described in claim 1 of this invention, and fully realizes the entire process from instruction parsing to segmentation result output. The following is a detailed description of each step:
[0065] In this embodiment, step one, acquiring the remote sensing image to be segmented and the natural language command, is implemented as follows:
[0066] Specifically, the remote sensing images to be segmented can be high-resolution images acquired by satellite remote sensing platforms or aerial remote sensing equipment. The images contain land cover types covering common scenes such as urban buildings, vegetation, water bodies, roads, and industrial facilities, and the image data format is a standard format in the remote sensing field. The natural language commands input by the user are descriptive statements generated by the user according to the actual segmentation needs, and do not need to follow a specific format. Example commands include identifying farmland distributed along roads in suburban areas, extracting woodland around lakes, and segmenting small settlements in mountainous areas.
[0067] The data acquisition process is implemented through the data acquisition module, which can receive remote sensing image files and natural language command text uploaded by users through the communication interface, and can also read pre-stored remote sensing images and recorded command information from local storage devices to ensure the complete acquisition of image data and command information, providing basic data support for subsequent processing.
[0068] In this embodiment, step two, the implementation of dynamically parsing instructions to generate a structured task description vector:
[0069] Specifically, the core function of the dynamic instruction parsing module is to convert unstructured natural language instructions into structured feature vectors that the model can process. The implementation process is as follows:
[0070] First, a lightweight language model is used to semantically decompose natural language instructions. This lightweight language model is a small Transformer model pre-trained on remote sensing corpora, possessing efficient semantic understanding capabilities and low computational overhead. During the decomposition process, the model automatically identifies semantic sub-conditions in the instructions, which include three categories: spatial location, target object, and state condition. Spatial location refers to the relative or absolute location of the target feature; target object refers to the specific category of land cover to be segmented; and state condition refers to the attribute or state description of the target feature. For example, for the instruction to identify farmland distributed along a highway in a suburban area, the decomposition yields the spatial location as "distributed along a highway in a suburban area," the target object as "farmland," and the state condition as "natural growth state."
[0071] Secondly, each semantic sub-condition is converted into a feature vector through a learnable mapping network. The mapping network employs a multi-layer fully connected neural network structure, taking the text-encoded vectors of the semantic sub-conditions as input and outputting fixed-dimensional feature vectors. The network uses the ReLU activation function, and through training, the mapping network learns the correspondence between semantic sub-conditions and feature vectors, ensuring that the feature vectors accurately represent the core information of the semantic sub-conditions.
[0072] Finally, the feature vectors corresponding to all semantic sub-conditions are combined to form a structured task description vector. The combination method uses vector concatenation, which concatenates the feature vectors of each semantic sub-condition into a single vector in a preset order. The dimension of the concatenated task description vector is the sum of the dimensions of each sub-vector, and the dimension is fixed to ensure that it can match the input requirements of the subsequent large model.
[0073] The purpose of this step is to transform ambiguous natural language instructions into precise structured features, enabling large models to accurately understand the user's segmentation needs and providing clear task guidance for subsequent feature fusion and pixel alignment.
[0074] In this embodiment, step three, the large model processing for efficient parameter mixing and fine-tuning, is implemented as follows:
[0075] Specifically, the large model is built based on a pre-trained visual encoder and a language encoder. The model is optimized to adapt to the semantic segmentation task of remote sensing images through an efficient parameter mixing fine-tuning strategy. The implementation process is as follows:
[0076] The visual encoder uses a visual Transformer as its backbone network. This network segments remote sensing images into fixed-size image patches, embeds and encodes each patch, and then extracts multi-scale visual features of the images through a multi-head self-attention mechanism. The network layer number, number of attention heads, and other structural parameters of the visual Transformer adopt the default configuration of the pre-trained model to ensure that the general visual features learned in the pre-training stage can be fully utilized.
[0077] The language encoder employs a distilled BERT model, which reduces the number of model parameters and computational overhead while maintaining core semantic understanding capabilities through model distillation. The language encoder receives text-encoded information corresponding to the task description vector and extracts language features through a self-attention mechanism and a feedforward neural network, ensuring that language features are semantically consistent with visual features.
[0078] Efficient parameter blending fine-tuning is achieved through low-rank adaptation techniques and zero-initialization gated adapters. First, low-rank adaptation adapters are injected into each attention layer of the visual encoder and language encoder. Each low-rank adapter consists of two low-rank matrices; by decomposing the adapter matrix into the product of two low-rank matrices, the number of parameters to be trained is reduced. During fine-tuning, only the parameters of the low-rank adaptation adapters are trained, while the backbone network parameters of the visual encoder and language encoder remain frozen. This ensures model fit while reducing training costs and the risk of overfitting.
[0079] Secondly, a zero-initialization gating adapter is introduced before fusing visual and linguistic features. This adapter consists of gating units and a linear transformation layer. The gating units use a sigmoid activation function to control the weights of feature fusion, and the linear transformation layer is used to adjust the feature dimensions. All initial weights of the zero-initialization gating adapter are set to zero to ensure that the fused features are consistent with the output of the pre-trained model in the early stages of training, avoiding performance fluctuations in the early stages of the model due to the introduction of the adapter. As training progresses, the adapter gradually learns the optimal feature fusion strategy.
[0080] The purpose of this step is to enable the pre-trained large model to quickly adapt to the semantic segmentation task of remote sensing images through efficient parameter fine-tuning, and to achieve effective extraction and preliminary fusion of visual and linguistic features without the need for large-scale training.
[0081] In this embodiment, step four, the implementation of pixel-level subspace alignment:
[0082] Specifically, the purpose of pixel-level subspace alignment is to establish a correspondence between the task description vector and the target pixel features in the fused feature space, thereby improving the accuracy of segmentation. The implementation process is as follows:
[0083] First, the cosine similarity between the feature vector of each pixel in the fused feature map and the task description vector is calculated. The fused feature map is obtained by fusing visual features extracted by the visual encoder and linguistic features extracted by the language encoder through a zero-initialization gating adapter. The feature vector of each pixel is the high-dimensional feature representation of that position in the fused feature map. Before calculating the cosine similarity, both the pixel feature vector and the task description vector are L2 normalized to ensure consistent vector magnitudes. The formula for calculating the cosine similarity is:
[0084]
[0085] in, This represents the feature vector of a pixel in the fused feature map. Represents the task description vector. This represents the vector dot product operation. and Representing vectors respectively and The L2 norm.
[0086] Secondly, a contrastive learning loss function is constructed. This function is optimized to bring the target pixel features closer to the task description vector and to push the non-target pixel features further away from the task description vector. The expression for the contrastive learning loss function is:
[0087]
[0088] in, The feature vector representing the target pixel. This represents the feature vector of all pixels in the fused feature map. Indicates the total number of pixels. The temperature coefficient is used to adjust the smoothness of the similarity distribution, and a reasonable value range is adopted. This loss function enables the model to accurately distinguish between target pixels and non-target pixels by maximizing the similarity ratio between the target pixel features and the task description vector.
[0089] The purpose of this step is to establish the relationship between task requirements and image pixels at the feature level, enabling the model to accurately locate target ground feature pixels that match the user's instructions, thus providing accurate feature support for subsequent segmentation mask generation.
[0090] In this embodiment, step five, generating the initial semantic segmentation result, is implemented as follows:
[0091] Specifically, the segmentation mask decoder adopts an encoding / decoding structure, consisting of a transposed convolutional layer, a convolutional layer, and an activation function. The transposed convolutional layer is used to progressively restore the resolution of the fused feature map, making it consistent with the size of the original remote sensing image; the convolutional layer is used to perform dimensionality compression and feature optimization on the feature map after resolution restoration, and the final output channel number matches the number of categories corresponding to the segmentation task; the activation function adopts the softmax function, which transforms the feature map into a probability distribution of each pixel belonging to the target category.
[0092] The initial semantic segmentation results are presented as a segmentation mask. The value of each pixel in the mask represents the confidence level that the pixel belongs to the target feature. The higher the confidence level, the more the pixel matches the target feature features described by the user command. For example, for the command to extract woodland around a lake, the pixels in the woodland area in the initial segmentation mask have a higher confidence level, while the pixels in other areas have a lower confidence level.
[0093] The purpose of this step is to convert the aligned fused features into intuitive segmentation results, providing an initial prediction basis for subsequent self-training cycles.
[0094] In this embodiment, step six involves initiating a self-training loop to generate high-confidence paired samples:
[0095] Specifically, the core of the self-training loop is to expand the training samples with unlabeled data to improve model performance. The implementation process is as follows:
[0096] First, the large model, fine-tuned in step three, is applied to the unlabeled remote sensing image database for inference. The unlabeled remote sensing image database contains a large number of remote sensing images with scenes similar to the images to be segmented, covering various land cover combinations and environmental conditions to ensure sample diversity. The inference process is consistent with steps three through five, generating a predicted segmentation mask for each unlabeled image.
[0097] Secondly, the uncertainty estimate for each region in the predicted segmentation mask is calculated. The uncertainty estimate is based on the entropy value of the prediction distribution. For each region in the predicted segmentation mask, the entropy value of the predicted probability distribution of all pixels within that region is calculated. The entropy value is calculated using the following formula:
[0098]
[0099] in, Indicates the number of segmentation categories. Indicates that the pixel belongs to the first... The predicted probability of a class. A higher entropy value indicates greater uncertainty in the model's prediction of that region, while a lower entropy value indicates higher prediction confidence.
[0100] Then, high-confidence pseudo-labels are selected based on uncertainty estimates and a multi-view consistency verification strategy. A pre-set confidence threshold is within a reasonable range. First, predicted segmentation masks with uncertainty estimates below this threshold are selected, and then multi-view consistency verification is performed on these masks. The multi-view consistency verification strategy is implemented as follows: Multiple data augmentation transformations are applied to the same unlabeled remote sensing image, including random cropping, horizontal flipping, brightness adjustment, and contrast adjustment, generating 3 to 5 augmented views; each augmented view is input into a large model to obtain the corresponding prediction results; the consistency degree between multiple prediction results is calculated, using the intersection-over-union ratio (IoU) as a consistency metric; a higher IoU indicates better consistency; and predicted segmentation masks that meet the preset consistency standard are identified as high-confidence pseudo-labels.
[0101] Finally, a corresponding natural language description is automatically generated for each high-confidence pseudo-label. This natural language description generation is achieved through a language generation module. This module, based on the regional characteristics of the high-confidence pseudo-labels, including the region's spatial location, land cover type, and morphological features, combines these with a preset language template to automatically generate natural language instructions corresponding to the pseudo-labels, forming new instruction-segmentation mask pairing samples. For example, for a forest area corresponding to a certain high-confidence pseudo-label, the generated natural language instruction would be to identify forest areas distributed in a strip within a plain area.
[0102] The purpose of this step is to make full use of unlabeled data to generate high-quality training samples, providing a rich data source for model parameter updates, while also enabling automatic pairing of instructions and segmentation masks to reduce the cost of manual annotation.
[0103] In this embodiment, step seven, expanding the training set to update the model parameters, is implemented as follows:
[0104] Specifically, the new instruction-segmentation mask paired samples generated in step six are added to the original training set to form an expanded training set. The expanded training set contains the original labeled samples and the newly generated high-confidence paired samples, significantly increasing the number of samples and covering more scenarios and instruction types.
[0105] Model parameter updates employ incremental training, using an expanded training set to fine-tune the large model. During training, a mini-batch gradient descent optimization algorithm is used, with the learning rate set within a reasonable range. The number of training epochs is adjusted based on the sample size of the expanded training set to ensure the model can fully learn information from the new samples. During fine-tuning, the backbone network parameters of the visual encoder and language encoder are kept frozen; only the parameters of the low-rank adaptive adapter, the zero-initialization gated adapter, and the segmentation mask decoder are trained to prevent the model from forgetting knowledge from the original samples.
[0106] Through incremental training, the model's instruction comprehension and segmentation capabilities co-evolve. On one hand, the model can learn the semantic features of new instructions, improving its ability to understand diverse natural language instructions; on the other hand, new segmentation mask samples enable the model to more accurately identify target objects in different scenarios, thus improving segmentation accuracy.
[0107] The purpose of this step is to continuously optimize the model's performance through a self-training loop, enabling the model to adapt to more diverse segmentation tasks and respond to complex natural language instructions without retraining.
[0108] In this embodiment, step eight, outputting the final semantic segmentation result, is implemented as follows:
[0109] Specifically, after the parameter update in step seven, the large model processes the remote sensing image to be segmented and the natural language instructions obtained in step one again, following the same processing flow as steps two through five. The final semantic segmentation output includes a visualized segmentation mask image and pixel-level classification results. Different land cover categories in the segmentation mask image are marked with different colors to intuitively present the distribution range of the target land cover. The pixel-level classification results are output in the form of a data file, recording the category label of each pixel for easy subsequent analysis and application.
[0110] For example, for the instruction to identify industrial buildings located on the east side of a river in an urban area, the final output segmented mask image will show the industrial building area on the east side of the river marked with a specific color, while other areas will be presented with different colors or transparent colors. At the same time, the output classification result file records the label information of each pixel as to whether it is the target industrial building.
[0111] The purpose of this step is to provide users with accurate and intuitive semantic segmentation results, meeting the needs of practical application scenarios such as land use monitoring and urban planning.
[0112] Optimized Implementation Examples
[0113] The following provides four optimized embodiments, each corresponding to the technical features described in claims 2 to 6 of this invention, to further refine and optimize the technical solutions of the basic embodiments, thereby improving model performance and practicality.
[0114] Optimization of Implementation Example 1: Optimization of Dynamic Instruction Parsing: In step two of the basic implementation example, the optimized implementation of the dynamic instruction parsing module is as follows:
[0115] The lightweight language model employs a specialized pre-training method based on remote sensing domain corpora. The pre-training corpus contains a large amount of natural language instructions and land feature descriptions related to remote sensing imagery, enabling the model to more accurately identify semantic information unique to the remote sensing domain. For example, for instructions containing specialized land feature terminology, the model can accurately decompose its semantic sub-conditions, avoiding parsing errors caused by misunderstandings of terminology.
[0116] The semantic sub-condition mapping network employs a 2-3 layer fully connected neural network, with the number of hidden layer neurons set within a reasonable range. A batch normalization layer is added after each fully connected layer to accelerate training convergence and improve the stability of the feature vectors. After combining the feature vectors, the combined task description vector is subjected to L2 normalization to ensure uniform vector magnitude and reduce errors in subsequent similarity calculations.
[0117] The purpose of this optimized implementation is to improve the accuracy and stability of dynamic instruction parsing, enabling the task description vector to more accurately represent the semantics of natural language instructions, and providing a more reliable foundation for subsequent feature fusion and pixel alignment.
[0118] Optimized Example 2: Optimization of Efficient Hybrid Parameter Fine-Tuning: In step three of the basic example, the optimization of efficient hybrid parameter fine-tuning is implemented as follows:
[0119] The low-rank adaptive adapter employs a dynamic adjustment strategy for its low-rank matrix. A smaller rank is set in the early stages of training to allow the model to quickly adapt to the task; as training progresses, the rank is gradually increased to improve the adapter's expressive power. The low-rank matrix is decomposed using singular value decomposition to ensure that the decomposed matrix accurately approximates the functionality of the original adapter matrix.
[0120] The zero-initialization gating adapter employs a dual-gating structure, comprising an activation gate and a reset gate. The activation gate controls the fusion ratio of visual and linguistic features, while the reset gate adjusts the update speed of feature fusion weights. This dual-gating structure makes feature fusion more flexible, enabling adaptive adjustments to the fusion strategy based on different instructions and image features, further enhancing the effectiveness of feature fusion.
[0121] The purpose of this optimized implementation is to improve the flexibility and adaptability of efficient parameter mixing and fine-tuning, enabling large models to adapt to remote sensing image semantic segmentation tasks more quickly and improving the quality of feature extraction and fusion.
[0122] Optimization of Implementation Example 3: Pixel-level subspace alignment optimization: In step four of the basic implementation example, the pixel-level subspace alignment optimization is implemented as follows:
[0123] When calculating cosine similarity, a weighted cosine similarity is used, assigning different weights to different semantic sub-conditional feature vectors in the task description vector. For example, the feature vector corresponding to the target object has a higher weight than the spatial location and state conditions, ensuring that the model prioritizes feature matching of core target features. The weights are automatically learned through training, with the initial weights set to equal values.
[0124] A hard negative sample mining strategy is introduced into the contrastive learning loss function. When calculating the loss, only non-target pixels with high similarity to the task description vector are selected as hard negative samples, thereby improving the optimization efficiency of the loss function. Hard negative sample mining is achieved by selecting the top K non-target pixels with the highest similarity. The value of K is set to a reasonable range to ensure that the mined hard negative samples can effectively improve the model's discriminative ability.
[0125] The purpose of this optimized implementation is to enhance the accuracy of pixel-level subspace alignment, enabling the model to more clearly distinguish between target pixels and non-target pixels, thereby further improving the accuracy of the segmentation results.
[0126] Optimization of Example 4: Optimization of the self-training loop: In step six of the basic example, the optimization of the self-training loop is implemented as follows:
[0127] Uncertainty estimation is combined with the Monte Carlo dropout method. A dropout layer is added to the encoder layer of the large model, and the dropout layer is kept active during training and inference. Multiple forward propagations are performed on the same unlabeled image to obtain multiple prediction results. The variance of the multiple prediction results is calculated as the uncertainty estimate. The larger the variance, the higher the uncertainty of the model's prediction for that region. Combining it with entropy estimation can more comprehensively assess the prediction confidence.
[0128] The consistency measure for multi-view consistency verification uses a weighted combination of IoU and Kappa coefficient. IoU measures the overlap of predicted regions, while Kappa measures the consistency of classification results; both are weighted at 0.5. For high-confidence screening, a weighted scoring mechanism is used, with the uncertainty estimate weighted at 0.4 and the consistency measure weighted at 0.6. Predicted segmentation masks with a weighted score higher than a preset threshold are identified as high-confidence pseudo-labels, improving the accuracy of pseudo-label screening.
[0129] The purpose of this optimized implementation is to improve the quality of high-confidence pseudo-labels, ensure the reliability of the expanded training set, thereby improving the effect of model parameter updates and enabling the model's instruction understanding and segmentation capabilities to evolve more stably.
[0130] In this embodiment, the system implementation corresponds to the remote sensing image semantic segmentation system based on a self-trained large model as described in claims 7 to 9 of the present invention. This system is applicable to the methods described in the above basic and optimized embodiments. The specific implementation of each module is described in detail below:
[0131] Data Acquisition Module: The data acquisition module includes a communication interface and a storage interface. The communication interface supports both wired and wireless communication and can receive remote sensing image files and natural language command text uploaded by users through terminal devices. Supported image file formats are standard formats in the remote sensing field, and command text is in plain text format. The storage interface can read remote sensing image data and command information pre-stored in local storage devices, including conventional storage media such as hard drives and USB flash drives.
[0132] The core function of the data acquisition module is to ensure the complete acquisition and transmission of remote sensing images and natural language commands, and to transmit the acquired data to the command parsing module to provide data support for subsequent processing.
[0133] Instruction parsing module: The instruction parsing module is a processor that runs a dynamic instruction parsing program, which implements the dynamic instruction parsing function described in step two of the above method embodiments and in optimized embodiment 1. The instruction parsing module receives natural language instructions transmitted by the data acquisition module, decomposes semantic sub-conditions through a lightweight language model, transforms feature vectors using a mapping network, combines them to generate a structured task description vector, and transmits the task description vector to the model processing module.
[0134] Model processing module: The model processing module includes a processor and a storage unit storing large model parameters. The large model uses a visual Transformer as the backbone network of the visual encoder and a distilled BERT model as the language encoder. The visual encoder and the language encoder are connected and fine-tuned through a low-rank adaptive adapter and a zero-initialization gating adapter, which is consistent with the large model structure described in step three of the above method embodiments and optimization embodiment 2.
[0135] The model processing module receives remote sensing images transmitted by the data acquisition module and task description vectors transmitted by the instruction parsing module. It extracts visual features from the remote sensing images through a visual encoder and linguistic features from the task description vectors through a language encoder. It then fuses the visual and linguistic features through a zero-initialization gating adapter and transmits the fused features to the pixel alignment module.
[0136] Pixel Alignment Module: The pixel alignment module is a processor that runs a pixel-level subspace alignment program. This program implements the pixel-level subspace alignment function described in step four of the above method embodiments and in optimized embodiment 3. The pixel alignment module receives the fused features transmitted by the model processing module, calculates the similarity between each pixel feature vector and the task description vector, optimizes the feature alignment effect through a contrastive learning loss function, and transmits the aligned fused features to the segmentation and decoding module.
[0137] Segmentation Decoding Module: The segmentation decoding module is a processor that runs a segmentation mask decoding program, which implements the segmentation mask generation function described in step five of the above method embodiments. The segmentation decoding module receives the aligned and fused features transmitted by the pixel alignment module, processes them through a transposed convolutional layer, a convolutional layer, and an activation function to generate an initial semantic segmentation result, and transmits the initial semantic segmentation result to the self-training module and the result output module.
[0138] Self-training module: The self-training module includes a pseudo-label generation unit, a sample pairing unit, and a model update unit, all of which are implemented by the processor running the corresponding programs.
[0139] The pseudo-label generation unit receives large model parameters and image data from the unlabeled remote sensing image library transmitted by the model processing module, and implements the pseudo-label generation and uncertainty estimation functions described in step six of the above method embodiment and optimization embodiment 4, generating a predictive segmentation mask and filtering high-confidence pseudo-labels.
[0140] The sample pairing unit receives high-confidence pseudo-labels transmitted by the pseudo-label generation unit, and automatically generates corresponding natural language instructions through the language generation program to form instruction-segmentation mask paired samples.
[0141] The model update unit receives new paired samples transmitted by the sample pairing unit, adds them to the original training set, implements the incremental training of the model as described in step seven of the above method embodiment, updates the large model parameters and stores them in the storage unit of the model processing module.
[0142] Results Output Module: The results output module includes a display interface and a data interface. The display interface can present the final semantic segmentation results to the user in the form of a visual segmentation mask image and supports connection to conventional display devices. The data interface can output the pixel-level classification results in the form of a data file, which supports storage on storage devices or other system calls, providing data support for subsequent applications.
[0143] This invention converts natural language instructions into structured task description vectors through a dynamic instruction parsing module, enabling the model to respond to dynamically changing segmentation requirements and handle segmentation tasks beyond predefined categories without retraining, thus improving the flexibility and applicability of the technology. Through an efficient parameter hybrid fine-tuning strategy, it reduces training costs while ensuring that the pre-trained large model can quickly adapt to remote sensing image semantic segmentation tasks. Pixel-level subspace alignment and contrastive learning improve the accuracy of target feature recognition. By fully utilizing unlabeled data through a self-training loop, it achieves the co-evolution of the model's instruction understanding and segmentation capabilities, reducing manual annotation costs.
[0144] The technical solution of this invention can stably achieve semantic segmentation of remote sensing images, and is applicable to multiple fields such as land use monitoring, environmental change detection, and urban planning, and has practical application value.
[0145] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications or alterations to the above-disclosed technical content to create equivalent embodiments without departing from the scope of the present invention. Any simple modifications, equivalent changes and alterations made to the above embodiments based on the technical essence of the present invention without departing from the scope of the present invention shall still fall within the scope of the present invention.
Claims
1. A semantic segmentation method for remote sensing images based on a self-trained large model, characterized in that, Includes the following steps: Step 1: Acquire the remote sensing image to be segmented and the natural language instructions input by the user; Step 2: Use the dynamic instruction parsing module to parse the natural language instructions and generate a structured task description vector; Step 3: Input the remote sensing images and task description vectors into a large model that has been fine-tuned by efficient parameter mixing. The large model is built based on a pre-trained visual encoder and a language encoder and fine-tuned by low-rank adaptation and zero-initialization gating adapter. Step 4: Perform pixel-level subspace alignment in the fusion feature space of the large model. Through contrastive learning, the distance between the task description vector and the corresponding target pixel feature is shortened, while the distance between the task description vector and the non-target pixel feature is widened. Step 5: Based on the aligned fused features, generate the initial semantic segmentation result using a segmentation mask decoder; Step 6: Start the self-training loop, perform inference on the large model on unlabeled remote sensing image data, filter out high-confidence segmentation results and automatically generate corresponding instruction-segmentation mask pairing samples; Step 7: Expand the training set using the newly generated instruction-segmentation mask paired samples, update the parameters of the large model, and achieve the co-evolution of instruction understanding ability and segmentation ability; Step 8: Output the final optimized semantic segmentation results of the remote sensing image.
2. The remote sensing image semantic segmentation method based on a self-trained large model according to claim 1, characterized in that, The dynamic instruction parsing module in step two parses natural language instructions in the following specific ways: A lightweight language model is used to decompose natural language instructions into multiple semantic sub-conditions, which include spatial location, target object, and state condition. Each semantic subcondition is transformed into a corresponding feature vector through a learnable mapping network; The feature vectors of all semantic sub-conditions are combined into a structured task description vector.
3. The remote sensing image semantic segmentation method based on a self-trained large model according to claim 1, characterized in that, The efficient hybrid fine-tuning of parameters in step three specifically involves: Inject low-rank adaptation adapters into the attention layers of the pre-trained visual encoder and language encoder; Only train the parameters of the low-rank adaptive adapter; Before fusing visual and linguistic features, a zero-initialization gating adapter is introduced, which is initialized to zero.
4. The remote sensing image semantic segmentation method based on a self-trained large model according to claim 1, characterized in that, The pixel-level subspace alignment in step four specifically includes: Calculate the cosine similarity between the feature vector of each pixel in the fused feature map and the task description vector; Construct a contrastive learning loss function, which maximizes the similarity between target pixel features and task description vector, while minimizing the similarity between non-target pixel features and task description vector.
5. The remote sensing image semantic segmentation method based on a self-trained large model according to claim 1, characterized in that, The step six of initiating the self-training loop specifically includes: The large model is inferred on an unlabeled remote sensing image database to generate a predictive segmentation mask. Calculate the uncertainty estimate for each region in the predicted segmentation mask; Based on the uncertainty estimate and the multi-view consistency verification strategy, the predicted segmentation mask with a confidence level higher than the preset threshold is selected as a high-confidence pseudo-label. For each high-confidence pseudo-label, a corresponding natural language description is automatically generated to form a new instruction-segmentation mask pairing sample.
6. The remote sensing image semantic segmentation method based on a self-trained large model according to claim 5, characterized in that, The multi-perspective consistency verification strategy specifically includes: Applying multiple different data augmentation transformations to the same unlabeled remote sensing image generates multiple augmented views; Multiple enhanced views are input into a large model to obtain multiple prediction results; Calculate the degree of consistency among multiple prediction results; Consistency level is used as one of the indicators for evaluating the confidence level of false labels.
7. A remote sensing image semantic segmentation system based on a self-trained large model, applicable to the remote sensing image semantic segmentation method based on a self-trained large model as described in any one of claims 1 to 6, characterized in that, include: The data acquisition module is used to acquire the remote sensing images to be segmented and the natural language commands input by the user; The instruction parsing module is used to dynamically parse natural language instructions and generate structured task description vectors; The model processing module includes a large model that has been fine-tuned by efficient parameter mixing, used for feature extraction and fusion of input remote sensing images and task description vectors; The pixel alignment module is used to perform pixel-level subspace alignment operations in the fusion feature space of a large model. The segmentation and decoding module is used to generate a semantic segmentation mask based on the aligned fused features; The self-training module manages the self-training loop, including generating pseudo-labels, creating instruction-segmentation mask paired samples, and updating model parameters. The results output module is used to output the final semantic segmentation results.
8. The remote sensing image semantic segmentation system based on a self-trained large model according to claim 7, characterized in that, The large model in the model processing module uses a visual Transformer as the backbone network of the visual encoder. The language encoder uses a distilled BERT model; The visual encoder and the speech encoder are connected and their parameters are fine-tuned through a low-rank adaptation adapter and a zero-initialization gating adapter.
9. The remote sensing image semantic segmentation method and system based on a self-trained large model according to claim 1, characterized in that, The self-training module includes: The pseudo-label generation unit is used to generate predictions and calculate uncertainty estimates on unlabeled data; The sample pairing unit is used to automatically generate corresponding natural language instructions for high-confidence pseudo-labels, forming new training samples; The model update unit is used to incrementally train the large model using newly generated training samples.
10. A computer-readable storage medium, applicable to the remote sensing image semantic segmentation method based on a self-trained large model as described in any one of claims 1 to 6, wherein a computer program is stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method.