A visual language processing method for remote sensing images and related equipment
By using a hierarchical Vision Transformer visual encoder and a vision-language standardized projection mechanism, combined with remote sensing-specific labeling rules, multimodal joint representation and reasoning of remote sensing images are achieved. This solves the problems of deep semantic understanding and multi-turn dialogue analysis of remote sensing images, and improves the accuracy and flexibility of remote sensing image analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA CENT FOR RESOURCES SATELLITE DATA & APPL
- Filing Date
- 2026-05-09
- Publication Date
- 2026-06-26
AI Technical Summary
Existing remote sensing data services cannot effectively understand the deep semantics of remote sensing images, especially when dealing with multi-scale targets and differences in technical terms. The accuracy of semantic understanding is low, and there is a lack of multi-turn conversational analysis capabilities.
A hierarchical Vision Transformer visual encoder is used for multi-scale feature extraction. Modality alignment is achieved through a vision-language normalization projection mechanism. Language sequences are embedded using remote sensing-specific labeling rules to construct a multimodal joint representation, which is then input into a language model for inference.
It improves the semantic understanding capabilities of remote sensing images, enabling accurate identification and analysis of targets at different scales, and supports multi-turn dialogue-based analysis, making it applicable to fields such as agricultural monitoring, urban planning, and environmental protection.
Smart Images

Figure CN122289710A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the fields of artificial intelligence and remote sensing image processing technology. More specifically, this application relates to a visual language processing method and related equipment for remote sensing images. Background Technology
[0002] With the rapid development of remote sensing technology, the output of remote sensing data is constantly increasing, but existing remote sensing data services still face many limitations.
[0003] Traditional metadata-based keyword retrieval methods cannot effectively understand the deep semantics of remote sensing images and cannot meet the needs of users in different application scenarios. General-purpose multimodal models perform poorly in the field of remote sensing, especially when dealing with differences in technical terminology within remote sensing images; low semantic understanding accuracy negatively impacts the performance of multimodal models. Remote sensing images typically contain multi-scale targets, and existing models struggle to effectively handle remote sensing data of different resolutions and time series. Most existing remote sensing data analysis systems only support single queries and lack multi-turn conversational analysis capabilities.
[0004] Therefore, there is an urgent need to propose a visual language processing method and related equipment for remote sensing images in order to at least solve some of the above problems. Summary of the Invention
[0005] The summary section introduces a series of simplified concepts, which will be further explained in detail in the detailed description section. This summary section is not intended to limit the key features and essential technical features of the claimed technical solution, nor is it intended to determine the scope of protection of the claimed technical solution.
[0006] Firstly, this application proposes a visual language processing method for remote sensing images, comprising:
[0007] Acquire input remote sensing image data and text commands;
[0008] Multi-scale feature extraction is performed on the above remote sensing images using a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features.
[0009] The visual features mentioned above are mapped to the feature representation space of the language model through a visual-language normalization projection mechanism to achieve alignment between the visual modality and the language modality and obtain the aligned visual features.
[0010] Based on the preset remote sensing-specific labeling rules, the aligned visual features are embedded into the language sequence through label injection and feature replacement to construct a multimodal joint representation;
[0011] The above multimodal joint representation is input into the language model for inference, generating the remote sensing image semantic understanding result corresponding to the above text command.
[0012] In one feasible implementation, the hierarchical Vision Transformer visual encoder performs multi-scale feature extraction on the remote sensing image to obtain visual features containing local detail features and global semantic features, including:
[0013] The aforementioned remote sensing images are divided into image region blocks of different scales according to a preset resolution rule;
[0014] Feature encoding is performed on image regions of different scales to obtain local and global visual features at the corresponding scales;
[0015] By fusing visual features at different scales through a cross-level attention mechanism, a unified visual feature representation is generated.
[0016] In one feasible implementation, the visual features are mapped to the feature representation space of the language model through the visual-language normalization projection mechanism to achieve alignment between the visual modality and the language modality, and to obtain the aligned visual features, including:
[0017] The above visual features are normalized to reduce the difference in feature distribution between the visual modality and the language modality;
[0018] The normalized visual features are converted into feature representations consistent with the hidden feature dimensions of the language model through a nonlinear mapping structure.
[0019] The mapped visual features are enhanced by performing a blended activation function to obtain aligned visual features.
[0020] In one feasible implementation, based on preset remote sensing-specific labeling rules, the aligned visual features are embedded into the language sequence via label injection and feature replacement to construct a multimodal joint representation, including:
[0021] Define basic tags for representing the overall semantics of remote sensing images, and extended tags for representing local patches, temporal context, and spatial context of remote sensing images;
[0022] Based on the semantic complexity of the text instructions, the usage type and quantity of the above-mentioned basic and extended tags are dynamically determined, and the determined remote sensing-specific tags are injected into the corresponding positions in the language sequence.
[0023] During the forward propagation of the model, the position of the aforementioned remote sensing-specific markers in the language sequence is identified, and the corresponding visual feature vectors are extracted from the visual features based on the image region, temporal information, or spatial context corresponding to the markers.
[0024] By performing a feature replacement operation, the language features at the corresponding marked positions in the above language sequence are replaced with the extracted visual feature vectors, so as to achieve the fusion of visual features and language sequence and form the above multimodal joint representation.
[0025] In one feasible implementation, the above-mentioned multimodal joint representation input language model is used for inference to generate a remote sensing image semantic understanding result corresponding to the above-mentioned text instruction, including:
[0026] The above multimodal joint representation is used as the input sequence of the language model;
[0027] Semantic reasoning and context modeling are performed on the above multimodal joint representation based on the language model;
[0028] Output remote sensing image understanding results that are semantically consistent with the above text instructions.
[0029] In one feasible implementation, the training steps of the hierarchical Vision Transformer visual encoder described above include:
[0030] A hierarchical Vision Transformer architecture is pre-trained on unlabeled remote sensing images, employing a masked autoencoder target and a rotating variable window attention mechanism to enhance the model's ability to capture multi-scale features of remote sensing images.
[0031] Contrastive learning is performed on large-scale remote sensing image-text pairs. The visual encoder is frozen, and only the visual-linguistic projection layer is trained. The alignment quality of multimodal representations is gradually improved through a progressive alignment strategy.
[0032] Supervised fine-tuning is performed on remote sensing command-response pairs, combined with LoRA low-rank adaptation and DeepSpeedZero-3 optimization techniques to reduce trainable parameters and improve training efficiency.
[0033] In one feasible implementation, the above method further includes:
[0034] The processing size is automatically adjusted according to the characteristics of the input image to ensure that the multi-scale feature extraction of the hierarchical Vision Transformer is adapted without losing key information.
[0035] The computational resources are dynamically adjusted based on the complexity of the image content to optimize the visual feature extraction process.
[0036] By employing a layered processing strategy, thumbnails are generated for large images, with key regions being processed at high resolution first, while non-key regions are processed at low resolution.
[0037] Secondly, the present invention also proposes a visual language processing system for remote sensing images, comprising:
[0038] The acquisition unit is used to acquire input remote sensing image data and text commands;
[0039] The extraction unit is used to perform multi-scale feature extraction on the above remote sensing images through a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features.
[0040] The mapping unit is used to map the above visual features to the feature representation space of the language model through a visual-language normalized projection mechanism, so as to achieve the alignment of visual modalities and language modalities and obtain the visual features that are aligned to them.
[0041] The embedding unit is used to embed the aligned visual features into the language sequence by means of label injection and feature replacement based on the preset remote sensing-specific labeling rules, so as to construct a multimodal joint representation;
[0042] The inference unit is used to infer the above-mentioned multimodal joint representation into the language model and generate the remote sensing image semantic understanding result corresponding to the above-mentioned text instructions.
[0043] Thirdly, the present invention also proposes an electronic device comprising: a memory and a processor, characterized in that the processor is configured to execute a computer program stored in the memory to implement the steps of the visual language processing method for remote sensing images as described in any of the first aspects.
[0044] Fourthly, the present invention also provides a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the visual language processing method for remote sensing images as described in any of the first aspects.
[0045] In summary, this method employs a hierarchical Vision Transformer visual encoder for multi-scale feature extraction, enabling the simultaneous acquisition of local detail features and global semantic features from remote sensing images, thereby enhancing the understanding of targets at different scales. Through a vision-language normalization projection mechanism, visual features are mapped to the language model feature space, achieving effective alignment between visual and linguistic modalities and enhancing semantic consistency between image content and text instructions. By embedding visual features into language sequences using remote sensing-specific labeling rules, different regions in the image and their spatiotemporal context information can be more accurately expressed, improving multimodal joint representation capabilities. After inputting the multimodal joint representation into the language model, the system can combine image features and text instructions to generate corresponding remote sensing image understanding results. This not only identifies image targets but also performs scene analysis such as urban expansion and vegetation changes. Based on these technologies, this method provides more accurate remote sensing image analysis results and can be applied to fields such as agricultural monitoring, urban planning, and environmental protection.
[0046] Other advantages, objectives and features of this application will be apparent in part from the description which follows, and in part from what those skilled in the art will understand through study and practice of this application. Attached Figure Description
[0047] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit this specification. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:
[0048] Figure 1 A schematic flowchart of a visual language processing method for remote sensing images provided in an embodiment of this application;
[0049] Figure 2 A schematic diagram illustrating the overall architecture of a visual language processing method for remote sensing images, provided in an embodiment of this application.
[0050] Figure 3 A schematic diagram of a hierarchical Vision Transformer structure provided for embodiments of this application;
[0051] Figure 4 A flowchart illustrating a visual-language projection mechanism provided in this application embodiment;
[0052] Figure 5 This is an example diagram of remote sensing-specific marker injection and feature replacement provided in this embodiment;
[0053] Figure 6 This is a schematic diagram of a three-stage training strategy proposed in this invention;
[0054] Figure 7 A structural schematic diagram of a visual language processing system for remote sensing images provided in an embodiment of this application;
[0055] Figure 8 This is a schematic diagram of an electronic device structure provided in an embodiment of this application. Detailed Implementation
[0056] The terms "first," "second," "third," "fourth," etc. (if present) in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in a sequence other than that illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus. The technical solutions of the embodiments of this application will now be clearly and completely described in conjunction with the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them.
[0057] Please see Figure 1 and Figure 2 , Figure 1 This is a flowchart illustrating a visual language processing method for remote sensing images provided in an embodiment of this application. Figure 2 The overall architecture principle diagram of a visual language processing method for remote sensing images provided in this application embodiment may specifically include:
[0058] S110. Acquire the input remote sensing image data and text commands;
[0059] S120. Multi-scale feature extraction is performed on the above remote sensing image using a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features.
[0060] S130. The above visual features are mapped to the feature representation space of the language model through the visual-language normalization projection mechanism to achieve the alignment of the visual modality and the language modality and obtain the aligned visual features.
[0061] S140. Based on the preset remote sensing-specific labeling rules, the aligned visual features are embedded into the language sequence through label injection and feature replacement to construct a multimodal joint representation.
[0062] S150. Input the above multimodal joint representation into the language model for inference to generate the remote sensing image semantic understanding result corresponding to the above text command.
[0063] For example, first, the system acquires the input remote sensing image data and corresponding text instructions. Remote sensing images typically contain a large amount of environmental information and geographic data, while the text instructions provide the system with the task objective or problem description for parsing the image, such as "describe urban sprawl." By acquiring this data, the system can begin processing the image and guide the image understanding process through text.
[0064] The system utilizes a hierarchical Vision Transformer (ViT) visual encoder to extract multi-scale features from remote sensing images. This visual encoder divides the image into regions of different scales using a pyramid-like structure and fuses information from these different scales through a cross-level self-attention mechanism. In this way, the system can not only capture local detailed features of the image (such as details of buildings and roads) but also obtain global semantic features of the image (such as large-scale features of urban areas and natural landscapes), laying the foundation for subsequent analysis.
[0065] After extracting visual features, the system maps these visual features to the feature representation space of the language model through a vision-language normalization projection mechanism. This mechanism includes layer normalization, non-linear dimensionality mapping, and feature refinement to address the differences between visual and linguistic modalities. The purpose of this step is to ensure that visual and linguistic features can be compared and aligned in the same semantic space, enabling the system to effectively understand the semantic information of remote sensing images and prepare for subsequent inference.
[0066] Based on the aligned visual features, the system embeds these visual features into the language sequence through marker injection and feature replacement using pre-defined remote sensing-specific marker rules. Specifically, the system defines some special markers (such as... , , These markers (such as visual information and linguistic information) can accurately represent different parts of an image or temporal and spatial context information. In this way, visual information is embedded into the language sequence, forming a multimodal joint representation, enabling visual features and linguistic features to interact and be understood effectively within the same sequence.
[0067] The system inputs the constructed multimodal joint representation into a language model for inference. The language model performs semantic reasoning on this joint representation and, combined with text instructions, generates a semantic understanding result of the remote sensing image that matches the instructions. In other words, by combining visual and linguistic information through a deep learning model, the system can automatically identify and interpret various elements in remote sensing images, thereby generating an understanding result for text instructions, such as describing urban expansion or vegetation changes in the image.
[0068] In summary, this method employs a hierarchical Vision Transformer (ViT) visual encoder for multi-scale feature extraction. The system can simultaneously capture both local detail features and global semantic features of remote sensing images, significantly improving the understanding of targets at different scales and overcoming the limitations of traditional models in handling complex remote sensing images. This method utilizes a vision-language normalization projection mechanism to effectively map visual features to the feature space of a language model, achieving precise alignment between visual and linguistic modalities. This not only eliminates the differences between visual and linguistic modalities but also ensures deep semantic consistency between image content and text instructions, enabling the system to better understand and infer information in the image. This embodiment uses remote sensing-specific labeling rules to embed visual features into the language sequence through label injection and feature replacement. This method can accurately represent different parts of the image and temporal and spatial contextual information, thereby enhancing the expressive power of the multimodal joint representation and improving the accuracy and flexibility of remote sensing image understanding. Inputting the constructed multimodal joint representation into the language model for inference can generate remote sensing image understanding results that are semantically consistent with the instructions, based on image features and text instructions. The system can not only automatically identify targets in images but also understand command requirements and generate targeted semantic results, such as descriptions of specific scenarios like urban sprawl and vegetation changes. This process significantly improves the automatic understanding capabilities of remote sensing images, especially in complex application scenarios. Through the comprehensive application of the above technologies, this embodiment enables the system to provide intelligent remote sensing image analysis results based on multimodal information fusion. This capability can be widely applied in fields such as agricultural monitoring, urban planning, and environmental protection, providing users with more accurate and in-depth decision support.
[0069] In one feasible implementation, the hierarchical Vision Transformer visual encoder performs multi-scale feature extraction on the remote sensing image to obtain visual features containing local detail features and global semantic features, including:
[0070] The aforementioned remote sensing images are divided into image region blocks of different scales according to a preset resolution rule;
[0071] Feature encoding is performed on image regions of different scales to obtain local and global visual features at the corresponding scales;
[0072] By fusing visual features at different scales through a cross-level attention mechanism, a unified visual feature representation is generated.
[0073] For example, such as Figure 3 As shown, Figure 3 This is a schematic diagram of a hierarchical VisionTransformer structure provided in an embodiment of this application. In this process, the remote sensing image is divided into image region blocks of different scales according to a preset resolution rule. The size of these region blocks corresponds to local details, medium structure, and global semantics, respectively, based on the preset resolution rules (16×16, 32×32, 64×64). For high-resolution remote sensing images (e.g., 1024×1024), it is first divided into 16×16 region blocks to extract local detail features; then, the image size is adjusted to 512×512 through downsampling, and it is divided into 32×32 region blocks to extract medium structure features; finally, the image size is adjusted to 256×256 through downsampling, and it is divided into 64×64 region blocks to extract global semantic features.
[0074] Feature encoding is performed on image regions of different scales to obtain local and global visual features at the corresponding scale. For each image region of a scale, the hierarchical Vision Transformer extracts features through an encoder. The features of each scale region are independently encoded into a feature vector containing local and global information. Specifically, in each Transformer encoder layer, a cross-level self-attention module is introduced to calculate the correlation between feature blocks of different resolutions, thereby achieving the fusion of multi-scale information. Given the input feature map of the (i-1)th stage... The hierarchical structure is embedded through a convolutional tagging function. Map it to a new tag ,in The kernel size is Step size is Fill as The two-dimensional convolution operation. The dimensions of the new labeled image are calculated as follows:
[0075] ]
[0076] ]
[0077] Two-dimensional interpolation location coding is used to support adaptation to input images of different sizes. Traditional ViT location coding struggles to adapt to inputs of varying sizes and resolutions in remote sensing images, while two-dimensional interpolation location coding achieves adaptive representation of location information through learnable embedding vectors and spatial interpolation. The specific implementation steps are as follows: First, a learnable location coding matrix of size (H_max × W_max) × D is designed, where H_max and W_max are the maximum image size in the training data, and D is the feature dimension. These location coding vectors are learned and updated during training to adapt to the scale and positional relationships of different targets in the remote sensing image. Second, for an input image size of (H × W), bilinear interpolation is used to extract the corresponding location information from the location coding matrix.
[0078] This reduces the computational complexity of self-attention from O(N²) to O(N), significantly improving the efficiency of high-resolution remote sensing image processing. The basic computational process of multi-head attention is as follows:
[0079]
[0080] in These are query, key, and value matrices, respectively. For sequence length, For head dimension, This is the scaling factor.
[0081] Flash Attention avoids storing the complete attention matrix by breaking down attention computation into multiple blocks and performing local softmax computation within each block. The core idea is to leverage the properties of local softmax.
[0082]
[0083] in This can be any constant. By choosing an appropriate value, the computation process can be optimized without affecting the results. Given the output gradient... The gradient calculation for standard attention follows the chain rule:
[0084]
[0085] in The calculation utilizes the derivative property of the softmax function: for vectors and ,have .
[0086] In summary, this embodiment extracts multi-scale features from remote sensing images using a hierarchical Vision Transformer encoder and effectively fuses visual features at different scales using a cross-level attention mechanism, ultimately generating a unified visual feature representation. This method can capture local detail features and global semantic features in remote sensing images, providing accurate image representations for subsequent vision-language alignment and inference processes. Through this multi-scale feature extraction and fusion mechanism, the model can better understand and process complex targets and scenes in remote sensing images.
[0087] In one feasible implementation, the visual features are mapped to the feature representation space of the language model through the visual-language normalization projection mechanism to achieve alignment between the visual modality and the language modality, and to obtain the aligned visual features, including:
[0088] The above visual features are normalized to reduce the difference in feature distribution between the visual modality and the language modality;
[0089] The normalized visual features are converted into feature representations consistent with the hidden feature dimensions of the language model through a nonlinear mapping structure.
[0090] The mapped visual features are enhanced by performing a blended activation function to obtain aligned visual features.
[0091] For example, such as Figure 4 As shown, Figure 4 This document provides a flowchart of a vision-language projection mechanism for embodiments of this application. Visual features are subjected to layer normalization to address modality distribution differences. For input features... The calculation process for LayerNormalization is as follows:
[0092]
[0093] in and These are the mean and variance, respectively. To prevent small constants from being divided by zero, and These are learnable scaling and translation parameters. During backpropagation, gradient calculation follows the chain rule.
[0094] A two-layer MLP structure is used to expand the visual feature dimension and then compress it to the hidden dimension of the language model. The specific implementation is as follows:
[0095]
[0096] in and It is a weight matrix. This refers to the hidden layer dimension. A non-linear activation function is added in the middle to enhance feature representation. For small and dense targets in remote sensing images, the non-linear characteristics of MLP can better capture their complex spatial relationships and semantic features.
[0097] The spatial modeling capability is dynamically adjusted for different scenarios using GELU and FReLU activation functions. GELU (Gaussian Error Linear Unit) is the primary activation function, and its formula is as follows:
[0098]
[0099] in, Let T(x) be the cumulative distribution function of the standard normal distribution. FReLU models the spatial dependence of features using a two-dimensional funnel-shaped condition T(x), as shown in the formula:
[0100]
[0101] in, Given a two-dimensional spatial condition, it can extract the fine spatial layout of objects. The contribution of different activation functions is dynamically adjusted through a gating mechanism. For example, GELU is mainly used when processing natural scenes; when processing structured scenes such as man-made buildings, the weights of FReLU are increased to better capture their geometric features.
[0102] This embodiment effectively enhances the expressive power of visual features by combining normalization processing, nonlinear mapping, and hybrid activation functions, ensuring accurate alignment between visual and linguistic modalities. Furthermore, it improves the processing effect of remote sensing images in different scenarios by dynamically adjusting spatial modeling capabilities.
[0103] In one feasible implementation, based on preset remote sensing-specific labeling rules, the aligned visual features are embedded into the language sequence via label injection and feature replacement to construct a multimodal joint representation, including:
[0104] Define basic tags for representing the overall semantics of remote sensing images, and extended tags for representing local patches, temporal context, and spatial context of remote sensing images;
[0105] Based on the semantic complexity of the text instructions, the usage type and quantity of the above-mentioned basic and extended tags are dynamically determined, and the determined remote sensing-specific tags are injected into the corresponding positions in the language sequence.
[0106] During the forward propagation of the model, the position of the aforementioned remote sensing-specific markers in the language sequence is identified, and the corresponding visual feature vectors are extracted from the visual features based on the image region, temporal information, or spatial context corresponding to the markers.
[0107] By performing a feature replacement operation, the language features at the corresponding marked positions in the above language sequence are replaced with the extracted visual feature vectors, so as to achieve the fusion of visual features and language sequence and form the above multimodal joint representation.
[0108] For example, such as Figure 5 As shown, Figure 5 This embodiment provides an example diagram of remote sensing-specific marker injection and feature replacement. This embodiment details the specific implementation process of the remote sensing-specific marker system:
[0109] 1. Basic tag definition: Define the basic tag. This indicates global features of a remotely sensed image. For example, a user command might be: "Description of urban sprawl."
[0110] 2. Extended Tagging System: Design an advanced tagging system, including: : Represents the i-th tile in a remote sensing image; : Indicates the temporal context of a remotely sensed image; : Represents the spatial context of a remote sensing image.
[0111] 3. Dynamic injection strategy: Automatically adjusts the marker density based on the instruction complexity. For example, for a simple instruction like "identify buildings in an image," only one marker needs to be used. Mark; however, for complex instructions such as "compare vegetation cover changes in the same area in 2025 and 2024", it is necessary to use the following simultaneously: and The same area at different points in time is marked and associated with it through location coding.
[0112] 4. Feature Substitution Operation: During forward propagation, specific marker locations are identified and replaced with corresponding visual features, achieving seamless integration of visual information and language sequences. The mathematical definition of feature substitution is as follows: Given original image features... The area mask that needs to be replaced (1 indicates the area that needs to be replaced), and the new features. The feature replacement operation is defined as follows:
[0113]
[0114] in `resize` indicates element-wise multiplication, and `resize` indicates resizing. For example, for the instruction "compare vegetation cover changes", the positions of the two markers are identified, and the feature vectors of the third patch of the 2025 remote sensing image and the third patch of the 2024 remote sensing image are extracted from the improved ViT. The extracted visual feature vectors are then used to replace the corresponding marker positions in the language model input sequence to form the final input representation.
[0115] This embodiment improves the efficiency and accuracy of remote sensing image processing through refined training strategies and optimization techniques, providing strong support for image understanding tasks in practical applications.
[0116] In one feasible implementation, the above-mentioned multimodal joint representation input language model is used for inference to generate a remote sensing image semantic understanding result corresponding to the above-mentioned text instruction, including:
[0117] The above multimodal joint representation is used as the input sequence of the language model;
[0118] Semantic reasoning and context modeling are performed on the above multimodal joint representation based on the language model;
[0119] Output remote sensing image understanding results that are semantically consistent with the above text instructions.
[0120] For example, visual features of remote sensing images are first extracted using a hierarchical Vision Transformer (ViT), and then combined with linguistic features from text commands. These two features are then fused together using a pre-defined mechanism to form a multimodal joint representation. This joint representation includes detailed visual information from the image and semantic content from the text commands, constituting a multimodal input sequence.
[0121] After receiving the multimodal joint representation, the language model performs semantic reasoning, that is, understanding the deep semantic relationships between images and text instructions. This process includes identifying the task or objective described in the text instructions and combining this information with the image content to generate a reasonable inference result. Simultaneously, the model performs context modeling to ensure that the input text instructions match the content in the image and that it can deduce the correct semantics of the remote sensing image based on contextual information. Context modeling helps handle the contextual differences between images and text instructions, ensuring the accuracy and consistency of the inference results.
[0122] After performing inference and contextual modeling, the language model ultimately outputs remote sensing image understanding results that are semantically consistent with the text instructions. This means that the system can automatically analyze remote sensing images based on given text instructions (such as "describe the urban sprawl in this area"), extract key information related to the instructions, and generate corresponding semantic results, such as urban sprawl areas and changes in building density in the image.
[0123] In one feasible implementation, the training steps of the hierarchical Vision Transformer visual encoder described above include:
[0124] A hierarchical Vision Transformer architecture is pre-trained on unlabeled remote sensing images, employing a masked autoencoder target and a rotating variable window attention mechanism to enhance the model's ability to capture multi-scale features of remote sensing images.
[0125] Contrastive learning is performed on large-scale remote sensing image-text pairs. The visual encoder is frozen, and only the visual-linguistic projection layer is trained. The alignment quality of multimodal representations is gradually improved through a progressive alignment strategy.
[0126] Supervised fine-tuning is performed on remote sensing command-response pairs, combined with LoRA low-rank adaptation and DeepSpeedZero-3 optimization techniques to reduce trainable parameters and improve training efficiency.
[0127] For example, such as Figure 6 As shown, Figure 6 This is a schematic diagram of a three-stage training strategy proposed in this invention. This embodiment also provides a three-stage efficient training strategy, the specific implementation process of which includes:
[0128] Phase 1: Visual Encoder Pre-training. A hierarchical ViT architecture is pre-trained on unlabeled remote sensing images to learn general feature representations of the images. A masked autoencoder (MAE) objective and a rotated variable window attention mechanism are employed to enhance the model's ability to capture multi-scale features from the remote sensing images. The pre-trained visual encoder weights are output.
[0129] Phase Two: Vision-Language Joint Training. Comparative learning is performed on 828,700 remote sensing image-text pairs. The visual encoder is frozen, and only the vision-language projection layer is trained. A noise-robust progressive alignment strategy and curriculum learning are employed to gradually improve alignment quality. The output is a multimodal representation space of the alignment.
[0130] Phase 3: Command Fine-tuning. Supervised fine-tuning is performed on over 100,000 remote sensing command-response pairs. LoRA low-rank adaptation technique is used to reduce trainable parameters, and DeepSpeed Zero-3 is integrated to optimize distributed training, improving training efficiency and model performance. The complete RSchat-VL model is output.
[0131] This embodiment improves the efficiency and accuracy of remote sensing image processing through refined training strategies and optimization techniques, providing strong support for image understanding tasks in practical applications.
[0132] In one feasible implementation, the above method further includes:
[0133] The processing size is automatically adjusted according to the characteristics of the input image to ensure that the multi-scale feature extraction of the hierarchical Vision Transformer is adapted without losing key information.
[0134] The computational resources are dynamically adjusted based on the complexity of the image content to optimize the visual feature extraction process.
[0135] By employing a layered processing strategy, thumbnails are generated for large images, with key regions being processed at high resolution first, while non-key regions are processed at low resolution.
[0136] For example, this embodiment also provides a dynamic adaptive image processing scheme, specifically including:
[0137] Dynamic image resizing: Automatically adjusts the processing strategy based on the characteristics of the input image. For example, for high-resolution remote sensing images (such as 2048×2048), it automatically adjusts them to a size suitable for model processing (such as 1024×1024 or 512×512) while maintaining the integrity of key information.
[0138] Dynamic Patch Quantity Allocation: Computational resources are adjusted within the range of `min_dynamicPatch` and `max_dynamicPatch` based on the complexity of the image content. Specifically, when the image content is complex, the number of patches is increased to improve processing accuracy. When the image content is simple, the number of patches is decreased to improve processing efficiency. For example, in a city expansion analysis scenario, `min_dynamicPatch=32` and `max_dynamicPatch=128` can be set to automatically adjust the number of patches based on the building density in the image.
[0139] 3. Thumbnail Acceleration Mode: This mode employs a layered processing strategy for large images, prioritizing the processing of key regions. Specifically, it first generates thumbnails for large images, identifies key regions using object detection or semantic segmentation models, then processes the key regions at high resolution and the non-key regions at low resolution, thereby improving overall processing efficiency.
[0140] This embodiment significantly improves the efficiency and accuracy of remote sensing image processing through dynamic adaptive image size adjustment, computing resource allocation, and thumbnail acceleration mode, providing strong support for efficient and real-time remote sensing image analysis.
[0141] In one feasible implementation, the method proposed in this invention can realize the specific implementation process of a multi-turn dialogue support mechanism:
[0142] Conversation template management: Design a conversation template structure to store historical information from multi-turn conversations.
[0143] Session = {
[0144] "image_features": [ , , ..., Historical image features
[0145] "text Prompts": [ , , ..., ], # Historical text hints
[0146] "context Vector": c # Context Vector
[0147] }
[0148] Feature fusion mechanism: In each round of dialogue, the features of the currently input remote sensing image are fused with historical features through a gated attention mechanism: fused feature = GatedAttention(current feature, historical feature, contextVector) where GatedAttention is a gated attention function that can selectively fuse relevant historical information according to the context.
[0149] Hyper-wheel masking strategy: When the number of dialogue rounds exceeds a preset value, the model automatically enters hyper-wheel mode, supporting dozens of rounds of continuous dialogue by dynamically adjusting the attention mechanism and feature fusion strategy. For example, in disaster emergency scenarios, users can gradually clarify their needs through continuous dialogue, and the system can autonomously iterate its response and push the results.
[0150] Secondly, this invention also proposes a visual language processing system for remote sensing images, such as... Figure 7 As shown, it includes:
[0151] Acquisition unit 21 is used to acquire input remote sensing image data and text commands;
[0152] Extraction unit 22 is used to perform multi-scale feature extraction on the above remote sensing image through a hierarchical Vision Transformer visual encoder to obtain visual features containing local detail features and global semantic features.
[0153] The mapping unit 23 is used to map the above visual features to the feature representation space of the language model through a visual-language normalized projection mechanism, so as to achieve the alignment of visual modality and language modality and obtain the aligned visual features.
[0154] Embedding unit 24 is used to embed the aligned visual features into the language sequence by means of marker injection and feature replacement based on preset remote sensing-specific labeling rules to construct a multimodal joint representation;
[0155] The reasoning unit 25 is used to reason with the above-mentioned multimodal joint representation input into the language model to generate the remote sensing image semantic understanding result corresponding to the above-mentioned text instructions.
[0156] In one feasible implementation, a visual language processing system for remotely sensed images can also perform any step of the method proposed in the first aspect.
[0157] Thirdly, the present invention also proposes an electronic device 300, such as... Figure 8 As shown, it includes a memory 310, a processor 320, and a computer program 311 stored in the memory 310 and executable on the processor. When the processor 320 executes the computer program 311, it implements the steps of the visual language processing method for remote sensing images as described in any of the first aspects.
[0158] Fourthly, the present invention also proposes a computer-readable storage medium having a computer program stored thereon, characterized in that, when the computer program is executed by a processor, it implements the steps of the visual language processing method for remote sensing images as described in any one of the first aspects.
[0159] It should be noted that the descriptions of each embodiment in the above embodiments have different focuses. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0160] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A visual language processing method for remote sensing images, characterized in that, include: Acquire input remote sensing image data and text commands; Multi-scale feature extraction is performed on the remote sensing image using a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features. The visual features are mapped to the feature representation space of the language model through a visual-language normalization projection mechanism to achieve alignment between the visual modality and the language modality and obtain the aligned visual features. Based on preset remote sensing-specific labeling rules, the aligned visual features are embedded into the language sequence through label injection and feature replacement to construct a multimodal joint representation; The multimodal joint representation is input into the language model for inference to generate a remote sensing image semantic understanding result corresponding to the text instruction.
2. The visual language processing method for remote sensing images according to claim 1, characterized in that, The process involves performing multi-scale feature extraction on the remote sensing image using a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features, including: The remote sensing image is divided into image region blocks of different scales according to a preset resolution rule; Feature encoding is performed on image regions of different scales to obtain local and global visual features at the corresponding scales; By fusing visual features at different scales through a cross-level attention mechanism, a unified visual feature representation is generated.
3. The visual language processing method for remote sensing images according to claim 1, characterized in that, The process of mapping visual features to the feature representation space of a language model through a vision-language normalization projection mechanism to align visual modalities with language modalities and obtain the corresponding visual features includes: The visual features are normalized to reduce the difference in feature distribution between the visual modality and the language modality; The normalized visual features are converted into feature representations consistent with the hidden feature dimensions of the language model through a nonlinear mapping structure. The mapped visual features are enhanced by performing a blended activation function to obtain aligned visual features.
4. The visual language processing method for remote sensing images according to claim 1, characterized in that, The method, based on preset remote sensing-specific labeling rules, embeds the subsequent visual features into the language sequence through label injection and feature replacement to construct a multimodal joint representation, including: Define basic tags for representing the overall semantics of remote sensing images, and extended tags for representing local patches, temporal context, and spatial context of remote sensing images; Based on the semantic complexity of the text instructions, the usage type and quantity of the basic and extended tags are dynamically determined, and the determined remote sensing-specific tags are injected into the corresponding positions in the language sequence. During the forward propagation of the model, the position of the remote sensing special marker in the language sequence is identified, and the corresponding visual feature vector is extracted from the visual features based on the image region, time information or spatial context corresponding to the marker. By performing a feature replacement operation, the language features at the corresponding marked positions in the language sequence are replaced with the extracted visual feature vectors to achieve the fusion of visual features and language sequences, forming the multimodal joint representation.
5. The visual language processing method for remote sensing images according to claim 1, characterized in that, The step of inputting the multimodal joint representation into a language model for inference to generate a remote sensing image semantic understanding result corresponding to the text instruction includes: The multimodal joint representation is used as the input sequence of the language model; Semantic reasoning and context modeling are performed on the multimodal joint representation based on the language model; Output remote sensing image understanding results that are consistent with the semantics of the text instructions.
6. The visual language processing method for remote sensing images according to claim 1, characterized in that, The training steps for the hierarchical Vision Transformer visual encoder include: A hierarchical Vision Transformer architecture is pre-trained on unlabeled remote sensing images, employing a masked autoencoder target and a rotating variable window attention mechanism to enhance the model's ability to capture multi-scale features of remote sensing images. Contrastive learning is performed on large-scale remote sensing image-text pairs. The visual encoder is frozen, and only the visual-linguistic projection layer is trained. The alignment quality of multimodal representations is gradually improved through a progressive alignment strategy. Supervised fine-tuning is performed on remote sensing command-response pairs, combined with LoRA low-rank adaptation and DeepSpeed Zero-3 optimization techniques, to reduce trainable parameters and improve training efficiency.
7. The visual language processing method for remote sensing images according to claim 1, further comprising: The processing size is automatically adjusted according to the characteristics of the input image to ensure that the multi-scale feature extraction of the hierarchical Vision Transformer is adapted without losing key information. The computational resources are dynamically adjusted based on the complexity of the image content to optimize the visual feature extraction process. By employing a layered processing strategy, thumbnails are generated for large images, with key regions being processed at high resolution first, while non-key regions are processed at low resolution.
8. A visual language processing system for remote sensing images, characterized in that, include: The acquisition unit is used to acquire input remote sensing image data and text commands; The extraction unit is used to perform multi-scale feature extraction on the remote sensing image through a hierarchical Vision Transformer visual encoder to obtain visual features that include local detail features and global semantic features. The mapping unit is used to map the visual features to the feature representation space of the language model through a visual-language normalized projection mechanism, so as to achieve the alignment of the visual modality and the language modality and obtain the visual features that are aligned thereto. The embedding unit is used to embed the aligned visual features into the language sequence by means of marker injection and feature replacement based on the preset remote sensing-specific labeling rules, so as to construct a multimodal joint representation; The inference unit is used to infer the multimodal joint representation input into the language model and generate a remote sensing image semantic understanding result corresponding to the text instruction.
9. An electronic device, comprising: The memory and processor are characterized in that the processor, when executing a computer program stored in the memory, implements the steps of the visual language processing method for remote sensing images as described in any one of claims 1-7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the visual language processing method for remote sensing images as described in any one of claims 1-7.