A remote sensing image interpretation method of multi-modal information synchronization and model parameter optimization

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a Transformer baseline network and class label synchronization operation in the remote sensing image classification model, and optimizing weight initialization, the problems of insufficient multimodal information fusion depth and gradient instability are solved, achieving high-precision multi-label classification.

CN121600409BActive Publication Date: 2026-06-16UNIV OF SHANGHAI FOR SCI & TECH +1

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: UNIV OF SHANGHAI FOR SCI & TECH
Filing Date: 2026-01-21
Publication Date: 2026-06-16

Application Information

Patent Timeline

21 Jan 2026

Application

16 Jun 2026

Publication

CN121600409B

IPC: G06V20/10; G06V10/80; G06V10/764; G06V10/82; G06N3/0455; G06N3/084

CPC: G06V20/10; G06V10/811; G06V10/764; G06V10/82; G06N3/0455; G06N3/084

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN121600409B_ABST

Patent Text Reader

Abstract

The application provides a remote sensing image interpretation method for multi-modal information synchronization and model parameter optimization, which comprises the following steps: constructing a multi-modal fusion model based on a Transformer benchmark network, wherein the initial weight value of the multi-modal fusion model is calculated according to the input connection number and the output connection number of the trainable layer; obtaining original remote sensing image data and processing the original remote sensing image data into N kinds of modal input token sequences; inputting the N kinds of modal input token sequences into N parallel ViT encoders of the multi-modal fusion model, each of which corresponds to the iterative processing of the input token sequence of one kind of mode, performing a class label synchronization operation on the N output token sequences of the N ViT encoders after each iteration, so as to be fused into a target class token; inputting the target class token into a classifier to output a multi-label classification result corresponding to the multi-modal remote sensing image data, and constructing a remote sensing image interpretation model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of multimodal remote sensing image data processing technology, and in particular to a remote sensing image interpretation method for multimodal information synchronization and model parameter optimization. Background Technology

[0002] With the development of Earth observation technology, the scale of remote sensing image archives continues to grow, providing a valuable source of information for monitoring the Earth's surface. Among these, multi-label classification is a core task, aiming to automatically assign multiple land cover category labels to each remote sensing image scene. In recent years, deep learning (DL) technology, especially the Transformer architecture based on self-attention mechanisms, has shown great potential in remote sensing image processing tasks due to its ability to effectively learn long-range context and spatial relationships between components within an image.

[0003] Remote sensing data often exhibits multimodal characteristics, meaning that data for the same geographic area can be acquired by different types of sensors (such as the Sentinel-1 synthetic aperture radar and the Sentinel-2 multispectral sensor). Jointly utilizing these multimodal images can provide a richer representation of the Earth's surface, thereby improving the effectiveness of image analysis tasks. In existing multimodal learning methods, a common technique is "early fusion," which stacks input images from different modalities along their channel dimensions and then feeds them into a single neural network for processing. However, while the Transformer architecture is increasingly popular in multimodal learning, its application in multimodal, multi-label remote sensing image classification scenarios remains very limited and requires further exploration. On the other hand, all deep learning methods face a common, fundamental challenge: the model training process. Deep neural networks are highly susceptible to gradient vanishing or exploding problems during training, which can lead to extremely slow convergence or even complete failure. Currently, various weight initialization techniques exist, among which Xavier initialization and He initialization are the most widely used advanced methods. However, existing literature and research on weight initialization techniques specifically optimized for the characteristics of satellite image data are very limited. Most research on remote sensing image classification focuses on developing new deep learning architectures, often neglecting the weight initialization process.

[0004] Therefore, there is a lack of comprehensive solutions in the existing technology that can combine advanced multimodal fusion architecture with specially optimized weight initialization strategies. Summary of the Invention

[0005] This application provides a remote sensing image interpretation method for multimodal information synchronization and model parameter optimization, which can provide multimodal information fusion depth while calculating the initial weight values of the model, thereby improving the stability and convergence speed of model training.

[0006] This application provides a method for remote sensing image interpretation that synchronizes multimodal information and optimizes model parameters. The method includes:

[0007] A multimodal fusion model is constructed based on the Transformer baseline network. The initial weight values of the multimodal fusion model are calculated based on the number of input connections and output connections of the trainable layers. The multimodal fusion model includes a classifier and N parallel ViT encoders.

[0008] Acquire raw remote sensing image data and process it into N modal input token sequences;

[0009] N modal input token sequences are input into N parallel ViT encoders. Each ViT encoder iteratively processes the input token sequence of one modality. After each iteration, a class tag synchronization operation is performed on the N output token sequences of the N ViT encoders to merge them into the target class token.

[0010] Input the target class token into the classifier, and output the multi-label classification result corresponding to the multimodal remote sensing image data. Repeat the steps of acquiring the original remote sensing image data to outputting the multi-label classification result until the model convergence condition is met. Based on the multimodal fusion model, a remote sensing image interpretation model is constructed.

[0011] In some embodiments, each ViT encoder includes multiple Transformer encoder blocks arranged in processing order. N modal input token sequences are input into N parallel ViT encoders. Each ViT encoder iteratively processes one modal input token sequence. A class tag synchronization operation is performed on the N output token sequences of the N ViT encoders after each iteration to fuse them into a target class token, including:

[0012] E1: Input the N modal input token sequences into the Transformer encoder blocks corresponding to the N ViT encoders, perform attention calculation, and obtain the N output token sequences corresponding to the N Transformer encoder blocks. The output token sequences include class tokens.

[0013] E2: Concatenate the N class tokens output by the N Transformer encoder blocks along the feature dimension to form a combined class token;

[0014] E3: Merge composite tokens into synchronization tokens;

[0015] E4: Replace the class token of the output token sequence with the synchronization class token as the new class token to obtain a new input token sequence. The new input token sequence is used to input the Transformer encoder block corresponding to the next processing order. Repeat steps E1 to E4 until the synchronization class token is calculated based on the class token of the Transformer encoder block of the last processing order. Then, determine the synchronization class token as the target class token.

[0016] In some embodiments, fusing composite tokens into synchronization tokens includes:

[0017] By using a trainable fusion transformation function, the composite tokens are mapped back to the original embedding dimension to obtain the synchronization tokens.

[0018] In some embodiments, the expression for the trainable fusion transformation function is as follows:

[0019]

[0020] in, Represents a synchronization token. Represents the token fusion weight matrix. Represents a composite token. The token fusion bias vector is a preset value. The token fusion weight matrix is one of the initial weight values, calculated based on the number of input and output connections of the cross-modal fusion layer in the trainable layer.

[0021] In some embodiments, acquiring raw remote sensing image data and processing it into N modal input token sequences includes:

[0022] Acquire raw remote sensing image data, which includes remote sensing image data of N modalities;

[0023] Preprocess the remote sensing image data of N modalities respectively;

[0024] The preprocessed remote sensing image data of each modality is divided into multiple non-overlapping image blocks, and the multiple non-overlapping image blocks are converted into block embedding token sequences corresponding to the modality. Before the block embedding token sequence, the class token corresponding to the modality is appended to form the input token sequence corresponding to the modality, so as to obtain the N modality input token sequences corresponding to the N modalities of remote sensing images.

[0025] In some embodiments, the initial weight values of the multimodal fusion model are calculated as follows:

[0026] Obtain the number of input and output connections of the trainable layers in the multimodal fusion model;

[0027] Calculate the initial upper and lower boundaries of the weight parameters of the training layer based on the number of input and output connections;

[0028] The initial weight values of the trainable layers of the multimodal fusion model are obtained by sampling from a uniform distribution that satisfies the initial upper and lower boundaries.

[0029] In some embodiments, the steps of acquiring the original remote sensing image data are repeated until the multi-label classification result output step is met, and a remote sensing image interpretation model is constructed based on the multimodal fusion model, including:

[0030] During the iterative training process of repeating the steps of acquiring original remote sensing image data to outputting multi-label classification results, the macro-average accuracy is calculated for the multi-label classification results obtained in each training iteration. The macro-average accuracy serves as a reward signal to guide the multimodal fusion model to adjust its direction in the next training iteration.

[0031] When the macroscopic average accuracy reaches the preset performance threshold, the optimized multimodal fusion model is obtained;

[0032] A remote sensing image interpretation model is constructed based on the optimized multimodal fusion model.

[0033] In some embodiments, a remote sensing image interpretation model is constructed based on the optimized multimodal fusion model, including:

[0034] By acquiring new raw remote sensing image data and inputting it into the optimized multimodal fusion model, new multi-label classification results are obtained.

[0035] The new multi-label classification results were validated using micro-average accuracy, macro-average accuracy, and F1-score to determine whether the multi-label classification results met the standards.

[0036] If the target is not met, the loss value of the optimized multimodal fusion model is calculated based on the unsatisfactory multi-label classification results. During the backpropagation stage of the optimized multimodal fusion model, the initial weight values of the trainable layers of the optimized multimodal fusion model are adjusted according to the loss value until the multi-label classification results meet the target and the remote sensing image interpretation model is constructed.

[0037] In some embodiments, the multimodal fusion model further includes a class label mapping layer, a cross-modal fusion layer, and trainable layers including a tile embedding layer, a query projection layer, a key projection layer and a value projection layer in each ViT encoder, a class label mapping layer, a cross-modal fusion layer and a classification head layer of a classifier.

[0038] In some embodiments, multi-label classification results are used to indicate whether land use / land cover has changed, or to indicate whether floods, earthquakes, landslides, or eutrophication have occurred.

[0039] Understandably, the remote sensing image interpretation method for multimodal information synchronization and model parameter optimization provided in this application inputs N modal input token sequences obtained from processing the original remote sensing image data into N parallel ViT encoders of a multimodal fusion model. Each ViT encoder iteratively processes the input token sequence of one modality. After each iteration, a class label synchronization operation is performed on the N output token sequences of the N ViT encoders to fuse them into a target class token input classifier. This improves the effectiveness of feature interaction between different modal remote sensing image data, provides a solid feature foundation for high-precision multi-label classification results, and solves the problem of insufficient depth in multimodal information fusion in existing methods.

[0040] Meanwhile, the initial weight values of the multimodal fusion model are calculated by the number of input and output connections in the training layer, ensuring that the multimodal fusion model has an appropriate weight parameter distribution. This ensures the stable flow of signal and gradient in the deep fusion network, greatly improving the stability and convergence speed of model training. It effectively solves the gradient vanishing or exploding problem caused by neglecting the characteristics of remote sensing data and abusing general initializations (such as Xavier or He) in existing studies. Attached Figure Description

[0041] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0042] Figure 1 A schematic diagram of the system framework of the multimodal fusion model provided in the embodiments of this application;

[0043] Figure 2 A flowchart illustrating a remote sensing image interpretation method for multimodal information synchronization and model parameter optimization provided in an embodiment of this application;

[0044] Figure 3 for Figure 1 A detailed schematic diagram of some modules in the system framework shown;

[0045] Figure 4 This is another flowchart illustrating the remote sensing image interpretation method for multimodal information synchronization and model parameter optimization provided in the embodiments of this application.

[0046] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation

[0047] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0048] The terms “first”, “second”, etc. used in this application are for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated.

[0049] Please see Figure 1 , Figure 1 This diagram illustrates the system framework of a multimodal fusion model for the remote sensing image interpretation method providing multimodal information synchronization and model parameter optimization, as presented in this application. The multimodal fusion model is constructed based on the Transformer baseline network structure and includes at least an original image processing layer, a parallel VIT encoder group, a similar label fusion transformation layer, and a classifier. This multimodal fusion model ultimately forms the remote sensing image interpretation model. In application, multimodal remote sensing image data is input into the remote sensing image interpretation model, which outputs multi-label classification results corresponding to the modal remote sensing image data. These multi-label classification results are used to indicate whether land use / land cover has changed, or to indicate whether floods, earthquakes, landslides, or eutrophication have occurred.

[0050] The technical solution of this application and how the technical solution of this application solves the above-mentioned technical problems will be described in detail below with specific embodiments. These specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments. The embodiments of this application will be described below with reference to the accompanying drawings.

[0051] Please combine Figures 1 to 3 , Figure 2 A flowchart illustrating the remote sensing image interpretation method for multimodal information synchronization and model parameter optimization provided in this application is shown below. Figure 2 As shown, this remote sensing image interpretation method may include the following steps:

[0052] Step S110: Construct a multimodal fusion model based on the Transformer baseline network. The initial weight values of the multimodal fusion model are calculated based on the number of input connections and the number of output connections of the trainable layers. The multimodal fusion model includes a classifier and N parallel ViT encoders.

[0053] In one implementation, the initial weight values of the multimodal fusion model are calculated as follows: First, the number of input connections and the number of output connections of the trainable layers of the multimodal fusion model are obtained. Next, the initial upper and lower boundaries of the weight parameters of the training layers are calculated based on the number of input and output connections. Finally, the initial weight values of the trainable layers of the multimodal fusion model are sampled from a uniform distribution that satisfies the initial upper and lower boundaries.

[0054] It is understood that by calculating the initial weight values through this implementation method, the multimodal fusion model can be ensured to have a suitable weight parameter distribution, thereby reducing training instability caused by gradient vanishing, gradient exploding, and cross-modal feature scale differences.

[0055] like Figure 1 As shown, the trainable layer may include a tile embedding layer, a query projection layer, a key projection layer, and a value projection layer (not shown) in each ViT encoder of a parallel ViT encoder group (the parallel ViT encoder group includes N parallel ViT encoders), a class label mapping layer and a cross-modal fusion layer in a synchronous class label fusion transformation layer, and a classification head layer of a classifier.

[0056] Understandably, the tile embedding layer is responsible for segmenting and mapping the raw remote sensing image data of different modalities to a unified feature dimension. Therefore, the accuracy of the initial weights directly affects the stability of the subsequent Transformer. By using a unified initialization method, it is ensured that all modalities have a consistent feature scale before entering the backbone network.

[0057] It is understandable that the initial weight values of the query (Q) projection layer, key (K) projection layer, and value (V) projection layer of each ViT encoder are calculated, which means the initial weight values of the multi-head self-attention mechanism of each ViT encoder are calculated. This ensures that the multimodal data input has a consistent learning rate and gradient scale in the attention calculation, thereby guaranteeing that the class labeling synchronization mechanism can play a stable role in the early stage of the model.

[0058] It is understandable that calculating the initial weight values of the class label mapping layer and the cross-modal fusion layer can ensure that the cross-modal class labels have balanced and unbiased mapping feature values in the initial training stage, making cross-modal semantic fusion more stable and faster to converge.

[0059] It is understandable that by calculating the initial weight values of the classification head layer, the parameter distribution of the output classification result is kept consistent with that of the aforementioned layers, thus forming a complete closed loop of model parameter initialization. By keeping all layers based on the same initialization strategy, error propagation caused by inconsistent gradient scales between different modules can be avoided, significantly improving the stability and optimizability of the multimodal fusion model in the early stages of training.

[0060] Step S120: Acquire the raw remote sensing image data and process it into N modal input token sequences.

[0061] Specifically, such as Figure 1 The original image processing layer shown acquires the original remote sensing image data and processes it into N modal input token sequences.

[0062] In one embodiment, step S120 includes the following steps:

[0063] Step S121: Acquire raw remote sensing image data, which includes remote sensing image data of N modalities. For example, the raw remote sensing image data may include two modalities: Sentinel-1 synthetic aperture radar (SAR) image and Sentinel-2 multispectral image.

[0064] Step S122: Preprocess the remote sensing image data of N modalities respectively.

[0065] Specifically, the image preprocessing layer of the original image processing layer performs preprocessing such as radiometric calibration, geometric fine correction, atmospheric correction, denoising and registration on remote sensing image data of N modalities to ensure the consistency of cross-modal data in spatial, temporal and radiometric scales.

[0066] Step S123: Divide the preprocessed remote sensing image data of each modality into multiple non-overlapping image blocks, and convert the multiple non-overlapping image blocks into a block embedding token sequence corresponding to the modality. Before the block embedding token sequence, attach the class token corresponding to the modality to form the input token sequence corresponding to the modality, so as to obtain the N modality input token sequences corresponding to the N modalities of remote sensing images.

[0067] Specifically, step S123 is performed by the tile embedding layer of the original image processing layer.

[0068] Step S130: Input the N modal input token sequences into N parallel ViT encoders. Each ViT encoder iteratively processes the input token sequence of one modality. Perform class tag synchronization operation on the N output token sequences of the N ViT encoders after each iteration to merge them into the target class token.

[0069] Specifically, such as Figure 1 As shown, the tile embedding layer inputs the obtained N modal input token sequences into the parallel ViT encoder group. Figure 3 for Figure 1 Detailed schematic diagrams of some system frameworks, such as... Figure 3 As shown, the parallel ViT encoder group consists of N parallel ViT encoders. Each ViT encoder iteratively processes the input token sequence of one modality to obtain the output token sequence corresponding to multiple iterations. Next, Figure 1The synchronous class-mark fusion transformation layer will perform class-mark synchronization operations on the output token sequences corresponding to the N ViT encoders after each iteration, so as to fuse them into a sequence like... Figure 3 The target type token is shown.

[0070] Understandably, compared to traditional "early fusion" methods, performing class label synchronization on the N output token sequences of the N ViT encoders after each iteration improves the effectiveness of feature interactions between different modalities, providing a solid feature foundation for high-precision multi-label classification results. This addresses the problem of insufficient depth in multi-modal information fusion in existing methods.

[0071] Step S140: Input the target class token into the classifier and output the multi-label classification result corresponding to the multimodal remote sensing image data. Repeat the steps of acquiring the original remote sensing image data to outputting the multi-label classification result until the model convergence condition is met. Based on the multimodal fusion model, a remote sensing image interpretation model is constructed.

[0072] Specifically, such as Figure 1 As shown, the synchronous class label fusion transformation layer inputs the target class token into the classifier head layer of the classifier, and the classifier head layer outputs the multi-label classification result corresponding to the multimodal remote sensing image data. The steps of acquiring the original remote sensing image data and outputting the multi-label classification result are repeated until the model convergence condition is met. Based on the multimodal fusion model, the remote sensing image interpretation model is constructed.

[0073] In application, the remote sensing image interpretation model takes raw remote sensing image data as input and outputs multi-label classification results. The multi-label classification results are used to indicate whether there has been a change in land use / land cover, or to indicate whether floods, earthquakes, landslides, or eutrophication have occurred.

[0074] Understandably, multi-label classification results are used to support Earth surface monitoring applications, specifically: 1. Land use / land cover (LULC) change detection; 2. Trend prediction, flood, earthquake, and landslide disaster monitoring and rapid assessment; 3. Intelligent decision support for environmental monitoring such as air quality and water eutrophication. Simultaneously, it can provide more valuable decision support in actual Earth observation tasks: In fine-grained land cover mapping, it ensures reliable alignment of optical and radar data, providing a scientific basis for resource planning. In agricultural monitoring, it improves the fusion accuracy of multispectral and SAR imagery, enhancing the accuracy of crop type identification and growth assessment; in disaster emergency response, it ensures consistency of data from different sensors, providing decision support for rapid and accurate assessment of the extent of disasters such as floods and fires. In urban functional zone identification, it provides reliable multi-source data fusion results, offering a quantitative basis for urban development and management.

[0075] Understandably, in the above technical solution, N modal input token sequences obtained from processing raw remote sensing image data are input into N parallel ViT encoders of a multimodal fusion model. Each ViT encoder iteratively processes the input token sequence of one modality. After each iteration, a class label synchronization operation is performed on the N output token sequences of the N ViT encoders to fuse them into a target class token input classifier. This improves the effectiveness of feature interactions between different modalities of remote sensing image data, provides a solid feature foundation for high-precision multi-label classification results, and solves the problem of insufficient depth in multimodal information fusion in existing methods.

[0076] Meanwhile, the initial weight values of the multimodal fusion model are calculated by the number of input and output connections in the training layer, ensuring that the multimodal fusion model has an appropriate weight parameter distribution. This ensures the stable flow of signal and gradient in the deep fusion network, greatly improving the stability and convergence speed of model training. It effectively solves the gradient vanishing or exploding problem caused by neglecting the characteristics of remote sensing data and abusing general initializations (such as Xavier or He) in existing studies.

[0077] In some embodiments, such as Figure 3 As shown, each ViT encoder includes multiple Transformer encoder blocks arranged in processing order. Step S130: Input N modal input token sequences into N parallel ViT encoders. Each ViT encoder iteratively processes one modality's input token sequence. Perform a class tag synchronization operation on the N output token sequences of the N ViT encoders after each iteration to fuse them into a target class token. This includes the following steps:

[0078] E1: Input the N modal input token sequences into the Transformer encoder blocks corresponding to the N ViT encoders, perform attention calculation, and obtain the N output token sequences corresponding to the N Transformer encoder blocks. The output token sequences include class tokens.

[0079] Specifically, the tile embedding layer inputs the input token sequence for each modality into the corresponding Transformer encoder block. Through the query projection layer, key projection layer, and value projection layer within the Transformer encoder block, attention calculations are performed on the input token sequence for that modality, resulting in N output token sequences corresponding to N Transformer encoder blocks. For example... Figure 3 As shown, the output token sequence includes feature tokens and class tokens.

[0080] E2: Concatenate the N class tokens output by the N Transformer encoder blocks along the feature dimension to form a combined class token.

[0081] Specifically, the class token mapping layer maps various tokens to feature dimensions and concatenates them along the feature dimensions to form composite class tokens.

[0082] E3: Merge composite tokens into synchronization tokens.

[0083] Specifically, the cross-modal fusion layer (which includes a trainable fusion transformation function) transforms the combined class tokens into synchronized class tokens, i.e., performs a class token synchronization operation. For example, the expression for performing one class token synchronization operation (or the expression for the trainable fusion transformation function) is: ,in, Represents a synchronization token. Represents the token fusion weight matrix. Represents a composite token. This represents the token fusion bias vector, which is preset. It can be understood that the token fusion weight matrix is one type of initial weight value, calculated based on the number of input and output connections of the cross-modal fusion layer in the trainable layers.

[0084] E4: Replace the class token of the output token sequence with the synchronization class token as the new class token to obtain a new input token sequence. The new input token sequence is used to input the Transformer encoder block corresponding to the next processing order. Repeat steps E1 to E4 until the synchronization class token is calculated based on the class token of the Transformer encoder block of the last processing order. Then, determine the synchronization class token as the target class token.

[0085] This example demonstrates how iterative calculations of steps E1 to E4 are performed on N modal input token sequences. By introducing a synchronous token fusion mechanism, semantic alignment and sharing between different modalities can be achieved during the encoding stage, enhancing cross-modal semantic consistency and establishing a unified semantic feature base. The synchronous token fusion mechanism can also characterize key semantic relationships between multimodal remote sensing image data. This shared feature is the core representation of the multi-label classification results generated by the current multimodal fusion model, and its quality directly affects the accuracy of the classification results and the proportion of validated multi-label classification results.

[0086] In some embodiments, step S130: Input the target class token into the classifier, output the multi-label classification result corresponding to the multimodal remote sensing image data, repeat the steps of acquiring the original remote sensing image data to outputting the multi-label classification result, until the model convergence condition is met, and construct the remote sensing image interpretation model based on the multimodal fusion model, including the following steps:

[0087] Step S131: During the iterative training of the original remote sensing image data acquisition step to the multi-label classification result output step, the macro-average accuracy is calculated for the multi-label classification result obtained in each training. The macro-average accuracy is used as a reward signal to guide the multimodal fusion model to adjust its direction in the next training iteration.

[0088] Step S132: When the macroscopic average accuracy reaches the preset performance threshold, the optimized multimodal fusion model is obtained.

[0089] Step S133: Input the new original remote sensing image data into the optimized multimodal fusion model to obtain new multi-label classification results.

[0090] Step S134: Use micro-average accuracy, macro-average accuracy and F1-score to verify the new multi-label classification results and determine whether the multi-label classification results meet the standards.

[0091] Step S135: If the target is not met, calculate the loss value of the optimized multimodal fusion model based on the unmet multi-label classification results. In the backpropagation stage of the optimized multimodal fusion model, adjust the initial weight values of the trainable layers of the optimized multimodal fusion model according to the loss value until the multi-label classification results meet the target and the remote sensing image interpretation model is constructed.

[0092] Understandably, by using macroscopic average accuracy as a reward signal and indicator, the aim is to maximize the validated classification results and minimize the substandard classification results, ultimately achieving stable multi-label classification results for remote sensing images.

[0093] Please see Figure 4 The above-mentioned embodiments will be further illustrated through an application scenario below.

[0094] A remote sensing image interpretation method for multimodal information synchronization and model parameter optimization includes the following steps:

[0095] Step S1: Acquire the raw remote sensing image data and process it into N modal input token sequences.

[0096] Step S2: Input the N modal input token sequences into the N parallel ViT encoders of the multimodal fusion model. Each ViT encoder iteratively processes the input token sequence of one modality. Perform class tag synchronization operation on the N output token sequences of the N ViT encoders after each iteration to fuse them into the target class token.

[0097] Step S3: Input the target class token into the classifier and output the multi-label classification results corresponding to the multimodal remote sensing image data.

[0098] Step S4: Calculate the macroscopic average accuracy for the multi-label classification results obtained from each training session.

[0099] Step S5: Determine whether the macroscopic average accuracy has reached the preset performance threshold. If not, proceed to step S6; if yes, proceed to step S7.

[0100] Step S6: Use the macroscopic average accuracy as a reward signal to guide the multimodal fusion model to adjust its direction in the next training cycle.

[0101] Step S7: Determine the optimized multimodal fusion model by training the current loop.

[0102] Step S8: Input the new original remote sensing image data into the optimized multimodal fusion model to obtain new multi-label classification results;

[0103] Step S9: Validate the new multi-label classification results using micro-average accuracy, macro-average accuracy, and F1-score to determine if the multi-label classification results meet the standards. If not, proceed to step S10; if they meet the standards, proceed to step S11.

[0104] Step S10: Calculate the loss value of the optimized multimodal fusion model based on the substandard multi-label classification results. During the backpropagation stage of the optimized multimodal fusion model, adjust the initial weight values of the trainable layers of the optimized multimodal fusion model according to the loss value.

[0105] Step S11: Determine the optimized multimodal fusion model as the remote sensing image interpretation model.

[0106] Those skilled in the art will understand that all or part of the steps of the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When executed, the program performs the steps of the above-described method embodiments; and the aforementioned storage medium includes various media capable of storing program code, such as ROM, RAM, magnetic disks, or optical disks.

[0107] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A method for interpreting remote sensing images with multimodal information synchronization and model parameter optimization, characterized in that, The methods include: A multimodal fusion model is constructed based on the Transformer baseline network. The initial weight values of the multimodal fusion model are calculated based on the number of input connections and the number of output connections of the trainable layers. The multimodal fusion model includes a classifier and N parallel ViT encoders. Acquire raw remote sensing image data and process it into N modal input token sequences; E1: Input the N modal input token sequences into the Transformer encoder blocks corresponding to the N ViT encoders, perform attention calculation, and obtain the N output token sequences corresponding to the N Transformer encoder blocks. The output token sequences include class tokens. E2: Concatenate the N class tokens output by the N Transformer encoder blocks along the feature dimension to form a combined class token; E3: Merge composite tokens into synchronization tokens; E4: Replace the class token of the output token sequence with the synchronization class token as the new class token to obtain a new input token sequence. The new input token sequence is used to input the Transformer encoder block corresponding to the next processing order. Repeat steps E1 to E4 until the synchronization class token is calculated based on the class token of the Transformer encoder block of the last processing order. Then, determine the synchronization class token as the target class token. Input the target class token into the classifier, and output the multi-label classification result corresponding to the multimodal remote sensing image data. Repeat the steps of acquiring the original remote sensing image data to outputting the multi-label classification result until the model convergence condition is met. Based on the multimodal fusion model, a remote sensing image interpretation model is constructed.

2. The method according to claim 1, characterized in that, The process of fusing composite tokens into synchronization tokens includes: By using a trainable fusion transformation function, the composite tokens are mapped back to the original embedding dimension to obtain the synchronization tokens.

3. The method according to claim 2, characterized in that, The expression for the trainable fusion transformation function is as follows: in, Represents a synchronization token. Represents the token fusion weight matrix. Represents a composite token. The token fusion bias vector is a preset value. The token fusion weight matrix is one of the initial weight values, calculated based on the number of input and output connections of the cross-modal fusion layer in the trainable layer.

4. The method according to claim 1, characterized in that, The process of acquiring raw remote sensing image data and processing it into N modal input token sequences includes: Acquire raw remote sensing image data, which includes remote sensing image data of N modalities; Preprocess the remote sensing image data of N modalities respectively; The preprocessed remote sensing image data of each modality is divided into multiple non-overlapping image blocks, and the multiple non-overlapping image blocks are converted into block embedding token sequences corresponding to the modality. Before the block embedding token sequence, the class token corresponding to the modality is appended to form the input token sequence corresponding to the modality, so as to obtain the N modality input token sequences corresponding to the N modalities of remote sensing images.

5. The method according to claim 1, characterized in that, The initial weight values of the multimodal fusion model are calculated as follows: Obtain the number of input and output connections of the trainable layers in the multimodal fusion model; Calculate the initial upper and lower boundaries of the weight parameters of the training layer based on the number of input and output connections; The initial weight values of the trainable layers of the multimodal fusion model are obtained by sampling from a uniform distribution that satisfies the initial upper and lower boundaries.

6. The method according to claim 1, characterized in that, The process of repeatedly acquiring raw remote sensing image data and outputting multi-label classification results continues until the model convergence condition is met. A remote sensing image interpretation model is then constructed based on the multimodal fusion model, including: During the iterative training of acquiring the original remote sensing image data and outputting the multi-label classification results, the macro-average accuracy is calculated for each multi-label classification result obtained during training. The macro-average accuracy is used as a reward signal to guide the multimodal fusion model to adjust its direction in the next training iteration. When the macroscopic average accuracy reaches the preset performance threshold, the optimized multimodal fusion model is obtained; A remote sensing image interpretation model is constructed based on the optimized multimodal fusion model.

7. The method according to claim 6, wherein the construction of the remote sensing image interpretation model based on the optimized multimodal fusion model comprises: By acquiring new raw remote sensing image data and inputting it into the optimized multimodal fusion model, new multi-label classification results are obtained. The new multi-label classification results were validated using micro-average accuracy, macro-average accuracy, and F1-score to determine whether the multi-label classification results met the standards. If the target is not met, the loss value of the optimized multimodal fusion model is calculated based on the unsatisfactory multi-label classification results. During the backpropagation stage of the optimized multimodal fusion model, the initial weight values of the trainable layers of the optimized multimodal fusion model are adjusted according to the loss value until the multi-label classification results meet the target and the remote sensing image interpretation model is constructed.

8. The method according to claim 1, characterized in that, The multimodal fusion model further includes a class label mapping layer and a cross-modal fusion layer. The trainable layer includes a tile embedding layer, a query projection layer, a key projection layer, and a value projection layer in each ViT encoder, the class label mapping layer, the cross-modal fusion layer, and a classification head layer of the classifier.

9. The method according to claim 1, characterized in that, The multi-label classification results are used to indicate whether land use / land cover has changed, or whether floods, earthquakes, landslides, or eutrophication have occurred.