A Chinese semantic understanding method for remote sensing image-text retrieval

By constructing a language model and a multi-level semantic parsing framework for the Chinese remote sensing domain, the problems of insufficient text understanding and inaccurate entity recognition in Chinese remote sensing image and text retrieval were solved, thereby improving the retrieval recall rate.

CN122309715APending Publication Date: 2026-06-30CHINA UNIV OF MINING & TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA UNIV OF MINING & TECH
Filing Date
2026-05-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing cross-modal image-text retrieval methods suffer from problems such as insufficient understanding of Chinese remote sensing text descriptions, ambiguous word segmentation, inaccurate entity recognition, insufficient spatial relationship modeling, and neglect of attribute information in Chinese remote sensing image retrieval, resulting in low retrieval recall rates.

Method used

A language model for the Chinese remote sensing domain is constructed, and a multi-level semantic parsing framework is designed. Through preprocessing and word segmentation, remote sensing domain-specific vocabulary and composite entities are identified, deep semantic features are extracted, a hierarchical semantic structure tree is constructed and converted into a graph structure representation, and cross-modal matching is performed.

Benefits of technology

It significantly improves the accuracy of Chinese remote sensing image and text retrieval and achieves deep semantic understanding of Chinese remote sensing query text.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309715A_ABST
    Figure CN122309715A_ABST
Patent Text Reader

Abstract

This invention discloses a Chinese semantic understanding method for remote sensing image-text retrieval, belonging to the field of remote sensing image processing technology. The method includes: acquiring a Chinese text query; preprocessing and segmenting the text; identifying remote sensing domain-specific vocabulary and composite entities; extracting deep semantic features of the text based on a pre-trained language model in the Chinese remote sensing domain; obtaining multi-granular representations at the word and sentence levels; parsing entity phrases, spatial relationships, and attribute information in the text; constructing a hierarchical semantic structure tree; converting the hierarchical semantic structure tree into a graph structure representation; generating a text semantic graph containing nodes, edges, and attribute constraints; performing cross-modal matching based on the text semantic graph and image features; and outputting the retrieval results. This invention, by constructing a language model in the Chinese remote sensing domain and designing a multi-level semantic parsing framework, achieves deep semantic understanding of Chinese remote sensing query text, significantly improving the accuracy of remote sensing image-text retrieval in Chinese scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of remote sensing image processing technology, and in particular to a Chinese semantic understanding method for remote sensing image and text retrieval. Background Technology

[0002] Current mainstream cross-modal image-text retrieval methods are based on deep learning, with CLIP (Contrastive Language-Image Pre-training) and its derivative models being typical examples. These methods generally employ a dual-tower architecture, extracting global image features and semantic text features via a visual encoder and a text encoder respectively. Then, a contrastive learning mechanism maps these dissimilar features to a shared semantic space, ultimately achieving image-text matching using cosine similarity. This approach has demonstrated excellent performance in natural image retrieval tasks.

[0003] However, directly applying the above methods to Chinese text remote sensing image retrieval has the following technical drawbacks:

[0004] First, Chinese remote sensing text descriptions possess unique linguistic characteristics, including a rich vocabulary of locative terms (such as "east side," "adjacent," and "surrounding"), complex expressions of spatial relationships (such as "located between..." and "distributed in a strip"), and a large number of domain-specific terms (such as "alluvial fan," "terraced fields," and "viaduct"). Existing methods mostly employ general pre-trained language models, lacking targeted modeling for the linguistic characteristics of the Chinese remote sensing domain, resulting in insufficient semantic understanding of the text and difficulty in accurately capturing user search intent.

[0005] Second, Chinese text suffers from problems such as ambiguous word segmentation and blurred entity boundaries. Especially in the field of remote sensing, entities such as place names, land feature types, and spatial orientations are intertwined. Existing general named entity recognition methods are unable to accurately identify complex entities in the field of remote sensing, leading to deviations in subsequent semantic parsing.

[0006] Third, Chinese spatial relationship expressions are highly flexible; the same spatial layout can be described using various sentence structures, such as "the building is located on the left side of the road," "there is a building on the right side of the road," and "the building is on the left side of the road," expressing the same layout relationship. Existing methods lack the ability to model synonym transformations of Chinese spatial relationships, failing to establish a unified semantic representation and reducing retrieval recall.

[0007] Fourth, Chinese remote sensing texts often contain implicit spatial quantification and temporal information. For example, descriptions such as "large areas of farmland," "newly built airport," and "dried-up lake" involve attributes such as area, time, and state. Existing methods only focus on entities and spatial relationships, neglecting the extraction and inference of attribute information, resulting in incomplete semantic understanding of complex descriptions. Summary of the Invention

[0008] The purpose of this invention is to provide a Chinese semantic understanding method for remote sensing image and text retrieval. By constructing a Chinese remote sensing language model and designing a multi-level semantic parsing framework, it achieves deep semantic understanding of Chinese remote sensing query text, significantly improving the performance of remote sensing image and text retrieval in Chinese scenarios.

[0009] To achieve the above objectives, this invention provides a Chinese semantic understanding method for remote sensing image and text retrieval, comprising:

[0010] S1. Obtain Chinese text query, preprocess and segment the text, and identify remote sensing domain-specific terms and compound entities;

[0011] S2. Based on a pre-trained language model in the field of Chinese remote sensing, extract deep semantic features of the text and obtain multi-granularity representations at the word and sentence levels;

[0012] S3. Analyze the entity phrases, spatial relationships and attribute information in the text to construct a hierarchical semantic structure tree;

[0013] S4. Convert the hierarchical semantic structure tree into a graph structure representation to generate a text semantic graph containing nodes, edges, and attribute constraints;

[0014] S5. Perform cross-modal matching based on text semantic graphs and image features, and output retrieval results.

[0015] Furthermore, the specific steps in S1 for obtaining Chinese text queries, preprocessing and segmenting the text, and identifying remote sensing domain-specific terms and compound entities include:

[0016] S1.1, Enter Chinese query text ,in Let L be the i-th character and L be the text length. Initial word segmentation is performed using a domain dictionary-assisted maximum matching algorithm. It includes remote sensing feature types, geographical names, spatial location terms, and attribute description terms;

[0017] S1.2. For candidate entities in the word segmentation results, conditional random fields or bidirectional long short-term memory networks are used for entity recognition in the remote sensing domain. The recognition labels include: place name, land cover type, spatial location, attribute features, quantifiers, and time words. The entity recognition formula is as follows:

[0018] ;

[0019] in, Given the input sequence, For the label sequence, For characteristic function, For weight parameters, As the normalization factor, The total number of characteristic functions, The label for the current location. The label for the previous position;

[0020] S1.3 For the identified composite entities, an attention-based entity linking model is used to align the entities with standard concepts in the remote sensing knowledge graph. The entity linking formula is as follows:

[0021] ;

[0022] in, For entities mentioned in the text, These are candidate concepts in the knowledge graph. , These are embedded representations of entities and concepts, respectively. For contextual representation, For multilayer perceptrons that are biased towards scoring relationships, Score for the link.

[0023] Furthermore, the specific steps in S2 for extracting deep semantic features of the text based on a pre-trained language model in the Chinese remote sensing domain, and obtaining multi-granular representations at the word and sentence levels, include:

[0024] S2.1 Construct a pre-training corpus for the Chinese remote sensing field, including remote sensing image titles, geographic information descriptions, remote sensing interpretation reports, and domain encyclopedia texts, with a corpus size of no less than 5 million characters;

[0025] S2.2. Based on the general Chinese pre-trained model, a domain-adaptive pre-training strategy is adopted, including full-word masking language modeling tasks and domain entity prediction tasks. The loss function is defined as follows:

[0026] ;

[0027] in, For full-word mask prediction loss, Predict loss for domain entity categories. This refers to the task weighting coefficient;

[0028] S2.3 Utilizing a pre-trained Chinese remote sensing domain language model Encode the query text to obtain word-level features. Sentence-level features The sentence-level features are taken from the output of the [CLS] tag or by using attention pooling:

[0029] ;

[0030] ;

[0031] in, For the first The hidden states of each character, where w is a learnable attention vector. This is the transpose of the attention vector. Let be the attention weight of the i-th element.

[0032] Furthermore, the specific steps in S3 for parsing entity phrases, spatial relationships, and attribute information in the text to construct a hierarchical semantic structure tree include:

[0033] S3.1 Based on the dependency parsing model, the syntactic structure of the text is analyzed, and the core predicates and their argument structures are identified. The dependency parsing is implemented using a graph neural network or a transition system, and the set of dependency arcs is output. ,in As a governing word, It is a dependency word. It is a dependency relationship type;

[0034] S3.2 Design a hybrid method combining spatial relation extraction rule templates and neural networks to extract spatial relation triples from dependency structures. ,in , As a spatial entity, Spatial relationship types include topological relationships, directional relationships, and distance relationships. The formula for calculating the spatial relationship classification score is as follows:

[0035] ;

[0036] in, , Entities , The context indicates that This is a multilayer perceptron for spatial relationship prediction. The dependency path representation is obtained by encoding the dependency arc sequence using bidirectional LSTM:

[0037] ;

[0038] in, Indicates feature splicing, Features encoded from the beginning to the end of the sequence. Features encoded from the end of the sequence to the beginning;

[0039] S3.3. Employ a cue-based learning-based attribute extraction method to identify attribute information in the text and construct attribute triples. ,in For entities, For attribute type, This is the attribute value. The template fill formula is as follows:

[0040] ;

[0041] Predicting the probability distribution of attribute values ​​using a masked language model:

[0042] ;

[0043] in, for The hidden states in the last layer of the language model are labeled. Candidate attribute values Embedded vector, For attributes The set of candidate values;

[0044] S3.4 Construct a hierarchical semantic structure tree by integrating entity, spatial relationship, and attribute information.

[0045] , where nodes Includes scene root node, entity node, attribute node, and edge. It indicates compositional relationships, spatial relationships, and attribute modification relationships.

[0046] Furthermore, in step S4, the hierarchical semantic structure tree is converted into a graph structure representation, generating a text semantic graph containing nodes, edges, and attribute constraints:

[0047] S4.1 Map entity nodes in the semantic structure tree to graph nodes, node features It is composed of the contextual representation of the entity mentions and the entity type embedding:

[0048] ;

[0049] in, Encoding the context of entity references, An embedding vector for the entity type;

[0050] S4.2 Map spatial relationships to graph edges, edge features It consists of relation type embedding and relation strength encoding. The relation strength is calculated from deterministic words or modifiers in the text.

[0051] ;

[0052] in, Embedding for relation types, This represents the strength of the relationship.

[0053] S4.3 Encode attribute information into node attribute constraints for entities. attribute constraint set ,in For attribute type, For attribute values, For confidence level, attribute constraints are used as filtering conditions in the subsequent graph matching stage;

[0054] S4.4, The final text semantic graph is represented as follows: ,in For a set of entity nodes, For the set of spatial relation edges, M represents the set of node attribute constraints, and M represents the number of entities.

[0055] Furthermore, the specific steps in S5 for cross-modal matching based on text semantic graphs and image features to output retrieval results include:

[0056] S5.1. Use an object detection network to extract candidate object regions from the image and generate a set of region visual features. and regional location code set Construct image region features ;

[0057] S5.2 Based on the spatial relationship between regions, calculate the geometric features between image regions and construct an image semantic map. The node features are region fusion features, and the edge features are encoded by relative position, distance, and orientation angle geometric quantities.

[0058] S5.3 Calculate the text semantic graph using a cross-modal graph matching network Image semantic graph Similarity, including node-level matching scores Edge-level matching score and attribute constraint satisfaction :

[0059] ;

[0060] ;

[0061] ;

[0062] in, It is the number of nodes. The node features of the j-th text node The number of candidate nodes at the image end. For the features of the i-th candidate entity, (⋅,⋅) represents the cosine similarity. The number of text edges. The relationship features between text node j and node l. The relationship features between end node i and node k in the image. Let j be the number of attributes of the j-th node. (⋅) is an indicator function;

[0063] S5.4. Combine multi-level matching scores to calculate the final image-text similarity:

[0064] ;

[0065] in, , , These are learnable fusion weights;

[0066] S5.5 Sort the image library according to similarity scores and output the Top-K search results.

[0067] Beneficial Effects: This invention provides a Chinese semantic understanding method for remote sensing image-text retrieval. It acquires Chinese text queries, preprocesses and segments the text, and identifies remote sensing-specific vocabulary and complex entities. Based on a pre-trained language model in the Chinese remote sensing domain, it extracts deep semantic features from the text, obtaining multi-granular representations at the word and sentence levels. It analyzes entity phrases, spatial relationships, and attribute information in the text to construct a hierarchical semantic structure tree. This hierarchical semantic structure tree is then converted into a graph structure representation, generating a text semantic graph containing nodes, edges, and attribute constraints. Cross-modal matching is performed based on the text semantic graph and image features to output retrieval results. This invention, by constructing a language model in the Chinese remote sensing domain and designing a multi-level semantic parsing framework, achieves deep semantic understanding of Chinese remote sensing query text, significantly improving the accuracy of image-text retrieval in Chinese scenarios. Attached Figure Description

[0068] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below:

[0069] Figure 1 This is a flowchart of a Chinese semantic understanding method for remote sensing image and text retrieval according to the present invention;

[0070] Figure 2 This is a flowchart of the Chinese remote sensing entity recognition and word segmentation process of this invention;

[0071] Figure 3 This is a diagram of the architecture of the pre-trained language model in the field of Chinese remote sensing of the present invention;

[0072] Figure 4 This is a flowchart of the hierarchical semantic parsing and graph construction process of the present invention;

[0073] Figure 5 This is a flowchart of the cross-modal graph matching and retrieval process of the present invention. Detailed Implementation

[0074] The embodiments of the present invention are described in detail below. Examples of the embodiments are shown in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, but should not be construed as limiting the present invention.

[0075] like Figure 1 As shown, this invention provides a Chinese semantic understanding method for remote sensing image and text retrieval, including:

[0076] S1. Obtain Chinese text query, preprocess and segment the text, and identify remote sensing domain-specific terms and compound entities;

[0077] S2. Based on a pre-trained language model in the field of Chinese remote sensing, extract deep semantic features of the text and obtain multi-granularity representations at the word and sentence levels;

[0078] S3. Analyze the entity phrases, spatial relationships and attribute information in the text to construct a hierarchical semantic structure tree;

[0079] S4. Convert the hierarchical semantic structure tree into a graph structure representation to generate a text semantic graph containing nodes, edges, and attribute constraints;

[0080] S5. Perform cross-modal matching based on text semantic graphs and image features, and output retrieval results.

[0081] Furthermore, such as Figure 2 As shown, in step S1, Chinese text queries are obtained, the text is preprocessed and segmented, and remote sensing-specific terms and compound entities are identified:

[0082] S1.1, Enter Chinese query text ,in Let L be the i-th character and L be the text length. Initial word segmentation is performed using a domain dictionary-assisted maximum matching algorithm. It includes remote sensing feature types, geographical names, spatial location terms, and attribute description terms;

[0083] S1.2. For candidate entities in the word segmentation results, conditional random fields or bidirectional long short-term memory networks are used for entity recognition in the remote sensing domain. The recognition labels include: place name, land cover type, spatial location, attribute features, quantifiers, and time words. The entity recognition formula is as follows:

[0084] ;

[0085] in, Given the input sequence, For the label sequence, For characteristic function, For weight parameters, As the normalization factor, The total number of characteristic functions, The label for the current location. The label for the previous position;

[0086] S1.3 For the identified composite entities, an attention-based entity linking model is used to align the entities with standard concepts in the remote sensing knowledge graph. The entity linking formula is as follows:

[0087] ;

[0088] in, For entities mentioned in the text, These are candidate concepts in the knowledge graph. , These are embedded representations of entities and concepts, respectively. For contextual representation, For multilayer perceptrons that are biased towards scoring relationships, Score for the link.

[0089] Furthermore, such as Figure 3 As shown, in step S2, a pre-trained language model based on the Chinese remote sensing domain is used to extract deep semantic features of the text, obtaining multi-granular representations at the word and sentence levels:

[0090] S2.1 Construct a pre-training corpus for the Chinese remote sensing field, including remote sensing image titles, geographic information descriptions, remote sensing interpretation reports, and domain encyclopedia texts, with a corpus size of no less than 5 million characters;

[0091] S2.2. Based on the general Chinese pre-trained model, a domain-adaptive pre-training strategy is adopted, including full-word masking language modeling tasks and domain entity prediction tasks. The loss function is defined as follows:

[0092] ;

[0093] in, For full-word mask prediction loss, Predict loss for domain entity categories. This refers to the task weighting coefficient;

[0094] S2.3 Utilizing a pre-trained Chinese remote sensing domain language model Encode the query text to obtain word-level features. Sentence-level features The sentence-level features are taken from the output of the [CLS] tag or by using attention pooling:

[0095] ;

[0096] ;

[0097] in, For the first The hidden states of each character, where w is a learnable attention vector. This is the transpose of the attention vector. Let be the attention weight of the i-th element.

[0098] Furthermore, such as Figure 4 As shown, the steps in S3 for parsing entity phrases, spatial relationships, and attribute information in the text to construct a hierarchical semantic structure tree are as follows:

[0099] S3.1 Based on the dependency parsing model, the syntactic structure of the text is analyzed, and the core predicates and their argument structures are identified. The dependency parsing is implemented using a graph neural network or a transition system, and the set of dependency arcs is output. ,in As a governing word, It is a dependency word. It is a dependency relationship type;

[0100] S3.2 Design a hybrid method combining spatial relation extraction rule templates and neural networks to extract spatial relation triples from dependency structures. ,in , As a spatial entity, Spatial relationship types include topological relationships, directional relationships, and distance relationships. The formula for calculating the spatial relationship classification score is as follows:

[0101] ;

[0102] in, , Entities , The context indicates that This is a multilayer perceptron for spatial relationship prediction. The dependency path representation is obtained by encoding the dependency arc sequence using bidirectional LSTM:

[0103] ;

[0104] in, Indicates feature splicing, Features encoded from the beginning to the end of the sequence. Features encoded from the end of the sequence to the beginning;

[0105] S3.3. Employ a cue-based learning-based attribute extraction method to identify attribute information in the text and construct attribute triples. ,in For entities, For attribute type, This is the attribute value. The template fill formula is as follows:

[0106] ;

[0107] Predicting the probability distribution of attribute values ​​using a masked language model:

[0108] ;

[0109] in, for The hidden states in the last layer of the language model are labeled. Candidate attribute values Embedded vector, For attributes The set of candidate values;

[0110] S3.4 Construct a hierarchical semantic structure tree by integrating entity, spatial relationship, and attribute information.

[0111] , where nodes Includes scene root node, entity node, attribute node, and edge. It indicates compositional relationships, spatial relationships, and attribute modification relationships.

[0112] Furthermore, such as Figure 4 As shown, in step S4, the hierarchical semantic structure tree is converted into a graph structure representation, generating a text semantic graph containing nodes, edges, and attribute constraints. The specific steps are as follows:

[0113] S4.1 Map entity nodes in the semantic structure tree to graph nodes, node features It is composed of the contextual representation of the entity mentions and the entity type embedding:

[0114] ;

[0115] in, Encoding the context of entity references, An embedding vector for the entity type;

[0116] S4.2 Map spatial relationships to graph edges, edge features It consists of relation type embedding and relation strength encoding. The relation strength is calculated from deterministic words or modifiers in the text.

[0117] ;

[0118] in, Embedding for relation types, This represents the strength of the relationship.

[0119] S4.3 Encode attribute information into node attribute constraints for entities. attribute constraint set ,in For attribute type, For attribute values, For confidence level, attribute constraints are used as filtering conditions in the subsequent graph matching stage;

[0120] S4.4, The final text semantic graph is represented as follows: ,in For a set of entity nodes, For the set of spatial relation edges, M represents the set of node attribute constraints, and M represents the number of entities.

[0121] Furthermore, such as Figure 5 As shown, in step S5, cross-modal matching is performed based on text semantic graphs and image features to output retrieval results. The specific steps are as follows:

[0122] S5.1. Use an object detection network to extract candidate object regions from the image and generate a set of region visual features. and regional location code set Construct image region features ;

[0123] S5.2 Based on the spatial relationship between regions, calculate the geometric features between image regions and construct an image semantic map. The node features are region fusion features, and the edge features are encoded by relative position, distance, and orientation angle geometric quantities.

[0124] S5.3 Calculate the text semantic graph using a cross-modal graph matching network Image semantic graph Similarity, including node-level matching scores Edge-level matching score and attribute constraint satisfaction :

[0125] ;

[0126] ;

[0127] ;

[0128] in, It is the number of nodes. The node features of the j-th text node The number of candidate nodes at the image end. For the features of the i-th candidate entity, (⋅,⋅) represents the cosine similarity. The number of text edges. The relationship features between text node j and node l. The relationship features between end node i and node k in the image. Let j be the number of attributes of the j-th node. (⋅) is an indicator function;

[0129] S5.4. Combine multi-level matching scores to calculate the final image-text similarity:

[0130] ;

[0131] in, , , These are learnable fusion weights;

[0132] S5.5 Sort the image library according to similarity scores and output the Top-K search results.

[0133] This invention presents a Chinese semantic understanding method for remote sensing image-text retrieval. By constructing a Chinese remote sensing domain language model and designing a multi-level semantic parsing framework, it achieves deep semantic understanding of Chinese remote sensing query text, significantly improving the accuracy of remote sensing image-text retrieval in Chinese scenarios. Specifically, this invention acquires Chinese text queries, preprocesses and segments the text, and identifies remote sensing domain-specific vocabulary and composite entities; based on a pre-trained Chinese remote sensing domain language model, it extracts deep semantic features of the text, obtaining multi-granular representations at the word and sentence levels; it parses entity phrases, spatial relationships, and attribute information in the text, constructing a hierarchical semantic structure tree; it converts the hierarchical semantic structure tree into a graph structure representation, generating a text semantic graph containing nodes, edges, and attribute constraints; and it performs cross-modal matching based on the text semantic graph and image features to output retrieval results.

[0134] The above-disclosed embodiments are merely one or more preferred embodiments of this application and should not be construed as limiting the scope of this application. Those skilled in the art can understand that implementing all or part of the above embodiments and making equivalent changes in accordance with the claims of this application still fall within the scope of this application.

Claims

1. A Chinese semantic understanding method for remote sensing image and text retrieval, characterized in that, include: S1. Obtain Chinese text query, preprocess and segment the text, and identify remote sensing domain-specific terms and compound entities; S2. Based on a pre-trained language model in the field of Chinese remote sensing, extract deep semantic features of the text and obtain multi-granularity representations at the word and sentence levels; S3. Analyze the entity phrases, spatial relationships and attribute information in the text, and construct a hierarchical semantic structure tree; S4. Convert the hierarchical semantic structure tree into a graph structure representation to generate a text semantic graph containing nodes, edges, and attribute constraints; S5. Perform cross-modal matching based on text semantic graphs and image features, and output retrieval results.

2. The Chinese semantic understanding method for remote sensing image and text retrieval as described in claim 1, characterized in that, The specific steps of S1 include: S1.1, Enter Chinese query text ,in Let L be the i-th character and L be the text length. Initial word segmentation is performed using a domain dictionary-assisted maximum matching algorithm. It includes remote sensing feature types, geographical names, spatial location terms, and attribute description terms; S1.

2. For candidate entities in the word segmentation results, conditional random fields or bidirectional long short-term memory networks are used for entity recognition in the remote sensing domain. The recognition labels include: place name, land cover type, spatial location, attribute features, quantifiers, and time words. The entity recognition formula is as follows: ; in, Given the input sequence, For the label sequence, For characteristic function, For weight parameters, As the normalization factor, The total number of characteristic functions, The label for the current location. The label for the previous position; S1.3 For the identified composite entities, an attention-based entity linking model is used to align the entities with standard concepts in the remote sensing knowledge graph. The entity linking formula is as follows: ; in, For entities mentioned in the text, These are candidate concepts in the knowledge graph. , These are embedded representations of entities and concepts, respectively. For contextual representation, For multilayer perceptrons that are biased towards scoring relationships, Score for the link.

3. The Chinese semantic understanding method for remote sensing image and text retrieval as described in claim 1, characterized in that, The specific steps of S2 include: S2.1 Construct a pre-training corpus for the Chinese remote sensing field, including remote sensing image titles, geographic information descriptions, remote sensing interpretation reports, and domain encyclopedia texts, with a corpus size of no less than 5 million characters; S2.

2. Based on the general Chinese pre-trained model, a domain-adaptive pre-training strategy is adopted, including full-word masking language modeling tasks and domain entity prediction tasks. The loss function is defined as follows: ; in, For full-word mask prediction loss, Predict loss for domain entity categories. This refers to the task weighting coefficient; S2.3 Utilizing a pre-trained Chinese remote sensing domain language model Encode the query text to obtain word-level features. Sentence-level features The sentence-level features are taken from the output of the [CLS] tag or by using attention pooling: ; ; in, For the first The hidden states of each character, where w is a learnable attention vector. This is the transpose of the attention vector. Let be the weight of the i-th element.

4. The Chinese semantic understanding method for remote sensing image and text retrieval as described in claim 1, characterized in that, The specific steps of S3 include: S3.1 Based on the dependency parsing model, the syntactic structure of the text is analyzed, and the core predicates and their argument structures are identified. The dependency parsing is implemented using a graph neural network or a transition system, and the set of dependency arcs is output. ,in As a governing word, It is a dependency word. It is a dependency relationship type; S3.2 Design a hybrid method combining spatial relation extraction rule templates and neural networks to extract spatial relation triples from dependency structures. ,in , As a spatial entity, Spatial relationship types include topological relationships, directional relationships, and distance relationships. The formula for calculating the spatial relationship classification score is as follows: ; in, , Entities , The context indicates that This is a multilayer perceptron for spatial relationship prediction. The dependency path representation is obtained by encoding the dependency arc sequence using bidirectional LSTM: ; in, Indicates feature splicing, Features encoded from the beginning to the end of the sequence. Features encoded from the end of the sequence to the beginning; S3.

3. Employ a cue-based learning-based attribute extraction method to identify attribute information in the text and construct attribute triples. ,in For entities, For attribute type, This is the attribute value. The template fill formula is as follows: ; Predicting the probability distribution of attribute values ​​using a masked language model: ; in, for The hidden states in the last layer of the language model are labeled. Candidate attribute values Embedded vector, For attributes The set of candidate values; S3.4 Construct a hierarchical semantic structure tree by integrating entity, spatial relationship, and attribute information. , where nodes Includes scene root node, entity node, attribute node, and edge. It indicates compositional relationships, spatial relationships, and attribute modification relationships.

5. The Chinese semantic understanding method for remote sensing image and text retrieval as described in claim 1, characterized in that, The specific steps of S4 include: S4.1 Map entity nodes in the semantic structure tree to graph nodes, node features It is composed of the contextual representation of the entity mentions and the entity type embedding: ; in, Encoding the context of entity references, An embedding vector for the entity type; S4.2 Map spatial relationships to graph edges, edge features It consists of relation type embedding and relation strength encoding. The relation strength is calculated from deterministic words or modifiers in the text. ; in, Embedding for relation types, This represents the strength of the relationship. S4.3 Encode attribute information into node attribute constraints for entities. attribute constraint set ,in For attribute type, For attribute values, For confidence level, attribute constraints are used as filtering conditions in the subsequent graph matching stage; S4.4, The final text semantic graph is represented as follows: ,in For a set of entity nodes, For the set of spatial relation edges, M represents the set of node attribute constraints, and M represents the number of entities.

6. The Chinese semantic understanding method for remote sensing image and text retrieval as described in claim 1, characterized in that, The specific steps of S5 include: S5.

1. Use an object detection network to extract candidate object regions from the image and generate a set of region visual features. and regional location code set Construct image region features ; S5.2 Based on the spatial relationship between regions, calculate the geometric features between image regions and construct an image semantic map. The node features are region fusion features, and the edge features are encoded by relative position, distance, and orientation angle geometric quantities. S5.3 Calculate the text semantic graph using a cross-modal graph matching network Image semantic graph Similarity, including node-level matching scores Edge-level matching score and attribute constraint satisfaction : ; ; ; in, It is the number of nodes. The node features of the j-th text node, The number of candidate nodes at the image end. For the features of the i-th candidate entity, (⋅,⋅) represents the cosine similarity. The number of text edges. The relationship features between text node j and node l. The relationship features between end node i and node k in the image. Let j be the number of attributes of the j-th node. (⋅) is an indicator function; S5.

4. Combine multi-level matching scores to calculate the final image-text similarity: ; in, , , These are learnable fusion weights; S5.5 Sort the image library according to similarity scores and output the Top-K search results.