Method and device for generating semantic description of remote sensing image, and electronic device
By co-processing the image feature optimization structure based on Transformer and the learnable query vector, the problem of dense visual features and semantic ambiguity in remote sensing image description methods is solved. It realizes efficient modeling of complex ground features and semantic relationships in remote sensing images under a single-stage architecture, and improves the accuracy and completeness of semantic description.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- AEROSPACE INFORMATION RES INST CAS
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, remote sensing image description methods struggle to generate comprehensive and accurate semantic descriptions in complex scenarios. Single-stage methods feature dense visual features but are semantically ambiguous, while two-stage methods incur high computational overhead and rely on external detection modules, resulting in description bias and insufficient real-time performance.
A method based on Transformer-based image feature optimization structure and learnable query vector is adopted for collaborative processing. The initial visual features are obtained by image feature extractor and remapped with query vector and context enhancement. Key semantic information is adaptively learned by multi-layer attention mechanism and target semantic description is generated by preset decoder.
The single-stage architecture improves the accuracy and completeness of semantic description of remote sensing images, solves the problem of dense visual features and semantic ambiguity, and significantly enhances the ability to generate descriptions in complex scenes.
Smart Images

Figure CN122242515A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of remote sensing image processing technology, and more specifically, to a method, apparatus, and electronic device for generating semantic descriptions of remote sensing images. Background Technology
[0002] Traditional image description methods typically employ manually designed feature extraction techniques to extract visual features from images. Common feature extraction methods include SIFT (Scale-Invariant Feature Transform) and SURF (Speeded-Up Robust Features). These methods analyze local regions in an image to extract local feature descriptors that are independent of changes in scale, rotation, and illumination. These local features are then encoded to obtain a visual feature representation of the entire image. After obtaining the visual feature representation, traditional natural language processing techniques are then used to generate an image description.
[0003] Building upon this foundation, various machine learning algorithms have been attempted to establish image-to-text mapping models, such as support vector machines, maximum entropy models, and conditional random fields. These methods primarily rely on manually extracted image and text features and establish image-to-text mapping models through traditional machine learning algorithms. Their performance is affected by feature extraction, and traditional natural language processing techniques often face limitations in language expression and semantic understanding, making it difficult to handle complex semantic relationships.
[0004] Therefore, deep learning-based image description generation methods have been proposed in related technologies. Image description methods in the remote sensing field are mainly divided into single-stage methods and two-stage methods that acquire semantic features as supplementary input models.
[0005] Among them, the single-stage method is an image description method based on encoder-decoder. Figure 1 This is a schematic diagram of a single-stage remote sensing image description method based on related technologies, such as... Figure 1 As shown, the input image is transformed into a fixed-length vector (image features) by an encoder, and then this vector is passed to a decoder to generate corresponding descriptive statements. However, this method directly obtains semantic information from image features, which makes it difficult to generate comprehensive and accurate descriptive statements in complex scenarios.
[0006] The two-stage approach utilizes image semantic features obtained from detectors and classifiers, such as targets and buildings in the image, as supplementary information to obtain more specific and accurate semantic information from remote sensing images, thereby generating more comprehensive and accurate descriptive statements. Figure 2 This is a schematic diagram of a two-stage remote sensing image description method based on related technologies, such as... Figure 2 As shown, the input image is transformed into image features by an encoder, and semantic information is obtained by a detector / classifier. Then, a descriptive statement is generated by a decoder. However, although this method can improve the accuracy of the description by extracting target semantic information through additional detectors or classifiers, it has a complex structure, high computational cost, and depends on the performance of external detection modules. It is prone to description deviation due to detection errors or missed detections. At the same time, independent training of the two stages makes it difficult to achieve end-to-end optimization, affecting the overall coordination and real-time performance.
[0007] There is currently no effective solution to the above problems. Summary of the Invention
[0008] This invention provides a method, apparatus, and electronic device for generating semantic descriptions of remote sensing images, to at least solve the technical problem in related technologies that cannot accurately generate semantic descriptions of remote sensing images.
[0009] According to one aspect of the present invention, a method for generating a semantic description of a remote sensing image is provided, comprising: receiving a target remote sensing image; extracting features from the target remote sensing image using an image feature extractor to obtain initial visual features; determining an initial query vector, and processing the initial visual features and the initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image, wherein the model structure of the preset generation model includes at least: a preset image feature optimization structure and a preset decoder, the preset image feature optimization structure being used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder being used to process the target features to obtain a target semantic description.
[0010] Furthermore, the step of processing the initial visual features and initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image includes: processing the initial visual features and initial query vector using a preset image feature optimization structure to obtain target features, wherein the target features include: target visual features and target query vector; and processing the target features using a preset decoder to generate a target semantic description.
[0011] Furthermore, the preset image feature optimization structure includes at least a projection layer and a mapping structure. The step of processing the initial visual features and the initial query vector using the preset image feature optimization structure to obtain the target features includes: using the projection layer to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence; combining the initial visual vector sequence and the initial query vector to obtain an initial combined vector; and using the mapping structure to process the initial combined vector to obtain the target features.
[0012] Furthermore, the mapping structure includes: multiple transformer modules. The steps of processing the initial combined vector using the mapping structure to obtain the target features include: inputting the initial combined vector into the first transformer module in the mapping structure; using the vector output by the first transformer module as the input vector of the next transformer module in the mapping structure, until the output vector of the last transformer module in the mapping structure is obtained; and representing the output vector as the target features.
[0013] Furthermore, the processing steps of each transformer module include: performing a linear transformation on the input vector based on the preset weight matrix of the transformer module to obtain the query projection, key projection, and value projection; determining the dot product between the query projection and the key projection to obtain the attention weights, and normalizing the attention weights; determining the head output vector based on the normalized attention weights and the value projection; concatenating all head output vectors to obtain the concatenated vector, and performing a preset operation on the concatenated vector to obtain the initial output vector; and processing the initial output vector using a multilayer perceptron to obtain the output vector.
[0014] Furthermore, before processing the initial visual features and initial query vector using a preset generation model to generate the target semantic description of the target remote sensing image, the process further includes: constructing an initial generation model, wherein the model structure of the initial generation model includes at least: an initial image feature optimization structure and an initial preset decoder; collecting a historical remote sensing image set and labeling each historical remote sensing image in the historical remote sensing image set to obtain labeling information, wherein the labeling information includes at least: a set of real semantic features and real descriptive statements; training the initial generation model using the historical remote sensing image set and the labeling information corresponding to each historical remote sensing image until the target loss value is less than a preset loss threshold, thereby obtaining the preset generation model, wherein the target loss value is determined by a target loss function, which includes: a query loss value and a description statement generation loss value, and the target loss function is composed of a query loss function and a description statement generation loss function.
[0015] Furthermore, in the process of training the initial generative model using a set of historical remote sensing images and the annotation information corresponding to each historical remote sensing image, the process also includes: constructing a bipartite graph based on the predicted query vector output by the optimized structure of the initial image features and the set of real semantic features, wherein the number of query operators in the predicted query vector is greater than the number of real semantic features in the set of real semantic features; matching each query operator with each real semantic feature based on the bipartite graph to obtain the matching cost of each matching result, wherein the matching result includes: the query operator corresponding to each real semantic feature; determining the matching result corresponding to the minimum matching cost as the target matching result; determining the current query loss value using a query loss function based on the query operator corresponding to each real semantic feature indicated by the target matching result; determining the current description statement generation loss value using a description statement generation loss function based on the predicted semantic description and the real description statement output by the initial preset decoder; determining the current loss value based on the current query loss value and the current description statement generation loss value; and determining the current loss value as the target loss value and stopping training when the current loss value is less than a preset loss threshold.
[0016] According to another aspect of the present invention, a semantic description generation apparatus for remote sensing images is also provided, comprising: a receiving unit for receiving a target remote sensing image; an extraction unit for extracting features from the target remote sensing image using an image feature extractor to obtain initial visual features; and a processing unit for determining an initial query vector and processing the initial visual features and the initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image, wherein the model structure of the preset generation model includes at least: a preset image feature optimization structure and a preset decoder, the preset image feature optimization structure being used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder being used to process the target features to obtain a target semantic description.
[0017] Furthermore, the processing unit includes: a first processing module, used to process the initial visual features and the initial query vector using a preset image feature optimization structure to obtain target features, wherein the target features include: target visual features and target query vector; and a second processing module, used to process the target features using a preset decoder to generate a target semantic description.
[0018] Furthermore, the preset image feature optimization structure includes at least a projection layer and a mapping structure. The first processing module includes: a first transformation submodule, used to use the projection layer to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence; a first combination submodule, used to combine the initial visual vector sequence and the initial query vector to obtain an initial combination vector; and a first processing submodule, used to use the mapping structure to process the initial combination vector to obtain the target features.
[0019] Furthermore, the mapping structure includes: multiple transformer modules, and the first processing submodule includes: a first input submodule, used to input the initial combined vector to the first transformer module in the mapping structure; a first input submodule, used to use the vector output by the first transformer module as the input vector of the next transformer module in the mapping structure, until the output vector of the last transformer module in the mapping structure is obtained; and a first representation submodule, used to represent the output vector as target features.
[0020] Furthermore, the third processing module includes: a first transformation submodule, used to perform a linear transformation on the input vector based on the preset weight matrix of the transformer module to obtain the query projection, key projection, and value projection; a first determination submodule, used to determine the dot product between the query projection and the key projection to obtain the attention weights, and to normalize the attention weights; a second determination submodule, used to determine the head output vector based on the normalized attention weights and the value projection; a first concatenation submodule, used to concatenate all the head output vectors to obtain the concatenated vector, and to perform a preset operation on the concatenated vector to obtain the initial output vector; and a second processing submodule, used to process the initial output vector using a multilayer perceptron to obtain the output vector.
[0021] Furthermore, the semantic description generation device also includes: a first construction module, used to construct an initial generation model before processing the initial visual features and initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image, wherein the model structure of the initial generation model includes at least: an initial image feature optimization structure and an initial preset decoder; a first annotation module, used to collect a historical remote sensing image set and annotate each historical remote sensing image in the historical remote sensing image set to obtain annotation information, wherein the annotation information includes at least: a set of real semantic features and real descriptive statements; a first training module, used to train the initial generation model using the historical remote sensing image set and the annotation information corresponding to each historical remote sensing image until the target loss value is less than a preset loss threshold, thereby obtaining a preset generation model, wherein the target loss value is determined by a target loss function, the target loss value includes: a query loss value and a description statement generation loss value, and the target loss function is composed of a query loss function and a description statement generation loss function.
[0022] Furthermore, the semantic description generation device also includes: a second construction module, used to construct a bipartite graph based on the predicted query vector output by the optimized structure of the initial image feature and the set of real semantic features during the training of the initial generation model using a set of historical remote sensing images and the annotation information corresponding to each historical remote sensing image; wherein the number of query operators in the predicted query vector is greater than the number of real semantic features in the set of real semantic features; a first matching module, used to match each query operator with each real semantic feature based on the bipartite graph to obtain the matching cost of each matching result; wherein the matching result includes: the query operator corresponding to each real semantic feature; and a first determination module. The first module is used to determine the matching result corresponding to the minimum matching cost as the target matching result; the second module is used to determine the current query loss value based on the query operator corresponding to each real semantic feature indicated by the target matching result and using a query loss function; the third module is used to determine the current description statement generation loss value based on the predicted semantic description output by the initial preset decoder and the real description statement and using a description statement generation loss function; the fourth module is used to determine the current loss value based on the current query loss value and the current description statement generation loss value; the fifth module is used to determine the current loss value as the target loss value and stop training if the current loss value is less than a preset loss threshold.
[0023] According to another aspect of the present invention, a computer program product is also provided, including a non-volatile computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for generating semantic descriptions of remote sensing images as described above.
[0024] According to another aspect of the present invention, an electronic device is also provided, including one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement any of the above-described methods for generating semantic descriptions of remote sensing images.
[0025] In this invention, a target remote sensing image is received, and an image feature extractor is used to extract features from the target remote sensing image to obtain initial visual features. An initial query vector is determined, and a preset generation model is used to process the initial visual features and the initial query vector to generate a target semantic description of the target remote sensing image. This solves the technical problem in related technologies that it is impossible to accurately generate a semantic description of a remote sensing image.
[0026] In this invention, an image feature optimization structure based on Transformer and a learnable query vector are used in a collaborative processing approach. The initial visual features output by the image feature extractor and the initial query vector are input into a preset image feature optimization structure. A multi-layer attention mechanism is used to remap and enhance the context of the visual features, and the query vector is guided to adaptively learn key semantic information from the image under semantic supervision to generate target features rich in semantic associations. Then, a preset decoder gradually generates a target semantic description based on the target features, thereby improving the alignment accuracy between visual features and semantic expression space. This achieves efficient modeling of complex features and semantic relationships in remote sensing images under a single-stage architecture, thus solving the technical problem of incomplete description and logical deviation caused by dense visual features and ambiguous semantics in traditional single-stage methods. This significantly improves the accuracy and completeness of semantic description of remote sensing images. Attached Figure Description
[0027] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings:
[0028] Figure 1 This is a schematic diagram of a single-stage remote sensing image description method based on related technologies;
[0029] Figure 2 This is a schematic diagram of a two-stage remote sensing image description method based on related technologies;
[0030] Figure 3 This is a flowchart of an optional method for generating semantic descriptions of remote sensing images according to an embodiment of the present invention;
[0031] Figure 4 This is a schematic diagram of an optional model structure based on attention mechanism and semantic feature prediction operator according to an embodiment of the present invention;
[0032] Figure 5 This is a schematic diagram of an optional semantic description generation device for remote sensing images according to an embodiment of the present invention;
[0033] Figure 6 This is a hardware structure block diagram of an electronic device (or mobile device) for generating a semantic description of remote sensing images according to an embodiment of the present invention. Detailed Implementation
[0034] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0035] It should be noted that the terms "first," "second," etc., used in this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0036] It should be noted that all related information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, and displayed data) collected and involved in this invention are information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, storage, use, processing, transmission, provision, disclosure, and application of this data comply with the relevant laws, regulations, and standards of the relevant regions, have taken necessary security measures, do not violate public order and good morals, and provide corresponding operation entry points for users to choose to authorize or refuse. For example, this system has an interface with relevant users or organizations. Before obtaining relevant information, a request to obtain the information needs to be sent to the aforementioned user or organization through the interface, and the relevant information is obtained only after receiving consent from the aforementioned user or organization.
[0037] With the development of deep learning, image description algorithms have also made progress. From the initial traditional algorithms to those based on encoder-decoder structures using CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), each stage of progress has benefited from the development of visual language pre-trained models. Whether it's the image feature extraction module using VGGNet (Visual Geometry Group Network) or the target interest region extraction algorithm using Faster R-CNN (Faster Region-based Convolutional Neural Network), both are based on corresponding visual pre-trained models. Simultaneously, the decoder has evolved from RNNs to structures using Transformers and GPT (Generative Pre-trained Transformers), employing corresponding natural language pre-trained models. The continuous emergence of visual language pre-trained models has made it possible to use cross-modal pre-trained models for image description tasks, providing a new direction for the development of remote sensing image description algorithms.
[0038] Despite the rapid development of image description algorithms, the field still faces many challenges. First, the quality of image descriptions remains significantly inconsistent. Current algorithms often generate descriptions that are relevant to the image content but inaccurate, or descriptions that are accurate in content but lack semantic logic. Therefore, establishing a better correspondence between images and text, ensuring semantic alignment in the mapping space, is a pressing issue. Furthermore, in remote sensing, images often possess richer visual features and semantic information than natural scenes, posing greater challenges for image description algorithms in acquiring detailed information and semantic relationships. Therefore, maintaining semantic consistency when generating long text descriptions of complex remote sensing images, and accurately describing the relationships between various objects within complex scenes, is also crucial.
[0039] Based on this, this invention addresses the discrepancy between visual features extracted by a CNN-based image feature encoder and semantic features generated by a decoder based on natural language models such as Transformer in single-stage methods. A Transformer-based image feature optimization module is proposed. By remapping the initially extracted image features to a new spatial dimension through a mapping layer and adding a multi-layer Transformer structure based on an attention mechanism, the model focuses more on key regions in the image, while the remapped visual and semantic features exhibit stronger consistency.
[0040] Furthermore, considering that single-stage image description methods typically obtain semantic information directly from image feature decoding, which struggles to generate comprehensive and accurate descriptive statements in complex scenarios, this invention proposes a semantic feature prediction operator. This operator is defined as a learnable query vector within the network and interacts with image features through multiple Transformer layers. While being supervised by semantic information, it directly learns key semantic features from image features. This further enhances the consistency of the visual semantic feature space and allows the semantic feature operator to directly predict semantic words in the current remote sensing image, strengthening the model's ability to directly extract relevant semantic information from images. Finally, the semantic features extracted by the semantic feature prediction operator and the visual features processed through mapping and attention mechanisms serve as input to the decoder, improving the accuracy and comprehensiveness of the model's generated descriptions in complex scenarios.
[0041] Furthermore, current remote sensing image description methods are mainly divided into two categories: one is the single-stage method based on encoder-decoder, and the other is the two-stage method that adds additional detectors and classifiers to further extract semantic features from the image based on the single-stage method. However, with the rapid development of remote sensing technology, the scale and complexity of remote sensing image data are constantly increasing. Although the single-stage method has a simple model structure and efficient processing flow, it often falls short in capturing fine-grained information and complex semantic relationships in these complex and ever-changing image scenes. These methods mainly rely on encoders to extract visual features and decoders to generate linguistic descriptions. This makes it difficult to generate detailed and accurate descriptions when remote sensing images have diverse land cover types and complex spatial relationships, and thus fails to accurately reflect all the important semantic information in the image.
[0042] To address the aforementioned issues, this invention proposes an image feature optimization method based on an attention mechanism and a semantic feature prediction operator. The former remaps image features obtained through an image feature extractor and adds a self-attention mechanism, generating visual features with rich contextual information from the originally dense image feature information. The semantic feature prediction operator interacts with image features through learnable query operators. Under semantic supervision, these operators can directly learn key semantic information from image features, thereby significantly improving the model's ability to understand and express image semantics. Considering the differences in semantic information contained in different remote sensing images, a single query operator may be insufficient to cover all semantic content in an image. This invention employs the Hungarian algorithm to establish an effective bipartite graph matching mechanism between the number of query operators and the true set of semantic words.
[0043] The present invention will now be described in detail with reference to various embodiments.
[0044] Example 1
[0045] According to an embodiment of the present invention, an embodiment of a method for generating semantic descriptions of remote sensing images is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0046] Figure 3 This is a flowchart of an optional method for generating semantic descriptions of remote sensing images according to an embodiment of the present invention, such as... Figure 3 As shown, the method includes the following steps:
[0047] Step S301: Receive the remote sensing image of the target.
[0048] In this embodiment of the invention, raw remote sensing images can be acquired from satellites or airborne platforms. The image format is a multispectral or panchromatic pixel matrix with a size of H×W×C (H is the height, W is the width, and C is the number of bands). It has not been manually labeled or semantically segmented and is only used as the raw input source for subsequent feature extraction.
[0049] Step S302: Use an image feature extractor to extract features from the target remote sensing image to obtain initial visual features.
[0050] In this embodiment of the invention, in the image description generation task, the quality of image feature extraction and representation is crucial for generating accurate and rich descriptions. The CLIP (Contrastive Language–Image Pretraining) model is pretrained on a massive multimodal dataset containing millions of image and text pairs from the internet. This large-scale pretraining enables CLIP to learn rich visual and linguistic knowledge, thereby understanding a wide range of visual concepts and linguistic expressions. Furthermore, considering CLIP's training on diverse data and its excellent generalization ability, the pretrained CLIP model can be used as a preliminary image feature extractor to obtain initial visual features by feature mapping of the target remote sensing image. Here, the initial visual features It is a dense image information of 1×512 dimensions, which is aggregated from the global semantic information of the image. Its essence is the compressed representation of the image in the cross-modal semantic space. It is not a spatialized feature map, nor a local key point response. It can provide an initial visual representation with semantic prior for subsequent feature optimization, avoiding the problem of insufficient generalization of handmade features (such as SIFT, SURF).
[0051] Step S303: Determine the initial query vector, and use a preset generation model to process the initial visual features and the initial query vector to generate a target semantic description of the target remote sensing image. The model structure of the preset generation model includes at least: a preset image feature optimization structure and a preset decoder. The preset image feature optimization structure is used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder is used to process the target features to obtain a target semantic description.
[0052] In this embodiment of the invention, considering that single-stage methods typically rely directly on the visual features of images to generate natural language descriptions, this method may not be able to fully capture and express all the details and contextual information when faced with rich semantic content in remote sensing images, resulting in a lack of richness and accuracy in the generated descriptions. Therefore, a semantic feature prediction operator is additionally introduced, namely, the learnable query vectors (Queries) (i.e., the initial query vector), denoted as... ,in, For the preset number of queries, each query vector (i=1, ..., These query vectors are initialized with a random Gaussian distribution before model training. They do not rely on external detectors or classification results, but are embedded within the model as semantic guidance signals. Their function is to interact with visual features through a self-attention mechanism, actively predicting missing or insufficiently expressed semantic components in the image, thus achieving implicit modeling of semantic information. These query vectors can enhance the model's ability to understand and predict image semantic information.
[0053] Then, by combining the initial query vector with the extracted initial visual features and feeding it into a pre-defined generative model, a target semantic description of the remote sensing image can be generated. During model processing, the query vector can predict ignored semantic information based on the current visual features and guide the mapped visual vector to focus more on image regions containing rich semantic information, reducing the expression of irrelevant information.
[0054] Here, the pre-defined generative model structure includes a pre-defined image feature optimization structure and a pre-defined decoder. The pre-defined image feature optimization structure is a multi-layer Transformer module. The input is the initial visual features, which are expanded into k vectors by a projection layer and concatenated with the initial query vector. These vectors then undergo self-attention interaction through the multi-layer Transformer to output the target features. The pre-defined decoder is another set of multi-layer Transformers. Using the target features as cross-attention key-value pairs, it takes the generated word sequence as input and outputs the target semantic description word by word. Its function is to jointly optimize the visual-semantic alignment space, achieving end-to-end semantic description generation without the need for two-stage detection, thus improving the completeness and consistency of descriptions in complex scenes.
[0055] In summary, a collaborative processing approach using an image feature optimization structure based on Transformer and learnable query vectors can be adopted. This involves inputting the initial visual features output from the image feature extractor and the initial query vector into a pre-defined image feature optimization structure. A multi-layer attention mechanism is used to remap and enhance the context of the visual features, guiding the query vector to adaptively learn key semantic information from the image under semantic supervision. This generates target features rich in semantic associations, which are then progressively generated by a pre-defined decoder based on these features. This improves the alignment accuracy between visual features and the semantic representation space, achieving efficient modeling of complex features and semantic relationships in remote sensing images within a single-stage architecture. This solves the technical problem of incomplete descriptions and logical deviations caused by dense visual features and ambiguous semantics in traditional single-stage methods, significantly improving the accuracy and completeness of semantic descriptions of remote sensing images.
[0056] To improve the accuracy of generating target semantic descriptions for remote sensing images, the semantic description generation method for remote sensing images provided in Embodiment 1 of this application uses a preset image feature optimization structure to process the initial visual features and the initial query vector to obtain target features, wherein the target features include: target visual features and target query vector; and a preset decoder is used to process the target features to generate a target semantic description.
[0057] In this embodiment of the invention, the initial visual features Extracted from the input remote sensing image by a pre-trained CLIP visual encoder, it is a 1×512 dimensional dense vector, the initial query vector. Let be the set of learnable parameter vectors in the network, and its size is . Each feature has the same dimensions as the initial visual features and is used to guide the model to focus on semantic regions in the image that are not fully represented. The preset image feature optimization structure consists of a projection layer and stacked layers. The system consists of a Transformer module, where the projection layer linearly maps the initial visual features into k high-dimensional visual vectors, forming a visual vector set. At the same time Initial query vectors Concatenate with the visual vector set to form a joint feature set. The joint feature set is sequentially passed through Each Transformer module performs interactive feature optimization, containing a multi-head self-attention mechanism and a feedforward neural network. After layer-by-layer iteration, it outputs a set of semantically enhanced visual vectors. = (i.e., target visual features) and a query vector set optimized by semantic supervision = (i.e., the target query vector), together they constitute the target features. .
[0058] Then, a pre-defined decoder is used to process the target features to generate a semantic description of the target. The pre-defined decoder is a stacked decoder. The Transformer decoder of the layer takes the target visual features and the target query vector from the target features mentioned above as input. The decoder generates a description sequence word by word in an autoregressive manner. The output of each step is obtained by calculating the joint representation of the target features and the generated word sequence through a cross-attention mechanism, and then outputs the probability distribution of the next word in the vocabulary through linear transformation and Softmax (soft maximum function) until a complete semantic description is generated.
[0059] In this embodiment, the initial visual features and the initial query vector are jointly optimized by adopting a preset image feature optimization structure, and the target features with both spatial semantic enhancement and semantic guidance capabilities are output. The decoder is then driven by the target features to generate semantic descriptions, which effectively solves the problems of insufficient semantic expression of visual features and difficulty in semantic space alignment in traditional single-stage methods. This allows the model to significantly improve the semantic coverage and accuracy and comprehensiveness of description generation for complex scenes in remote sensing images without introducing external detectors.
[0060] Optionally, the preset image feature optimization structure includes at least a projection layer and a mapping structure. In order to improve the accuracy of determining target features, in the semantic description generation method of remote sensing images provided in Embodiment 1 of this application, the projection layer is used to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence; the initial visual vector sequence and the initial query vector are combined to obtain an initial combined vector; and the mapping structure is used to process the initial combined vector to obtain the target features.
[0061] In this embodiment of the invention, the preset image feature optimization structure includes a projection layer, used to map the initially extracted dense visual features into a higher-dimensional space. This not only expands the expressive power of the features but also helps reveal the potential complex relationships between them. In this way, more comprehensive details and deeper semantic content in the image can be captured. Then, following the projection layer is a stack of [number missing] layers... The model incorporates multiple Transformer-based attention modules (i.e., mapping structures), which allows the generated image visual features to focus more on semantically relevant regions and effectively filter out information irrelevant to the current task. Through this mechanism, the model can more accurately identify and describe key features in remote sensing images.
[0062] In this embodiment of the invention, a projection layer can be first used to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence. The sequence is obtained through the projection layer. visual vectors , represented as : Here, the projection layer is a single-layer linear transformation layer. Its weight matrix parameters are optimized through training to linearly map the one-dimensional vector into a predetermined number of k visual vectors of dimension d, forming an initial visual vector sequence. Where k is set to 64 or 128 depending on the model capacity and task complexity. Then, the initial visual vector sequence and the initial query vector are combined to obtain the initial combined vector. The soon-to-be-learnable query vector and A combination of visual vectors Next, a mapping structure is used to process the initial combined vectors to obtain the target features. The mapping structure consists of stacked... The system consists of a multi-layer Transformer module, with each layer containing a multi-head self-attention mechanism and a feedforward neural network. Its input is an initial combination vector. The output is the target features optimized through multiple rounds of semantic interaction, including target visual features. = With the target query vector = Together, these two constitute the target features, which are used for subsequent semantic description generation.
[0063] In this embodiment, the structured decomposition of visual features is achieved through a projection layer, a joint input space is constructed by combining the initial query vector, and deep semantic interaction is performed through a mapping structure. This effectively improves the semantic expression granularity and semantic integrity of visual features, enabling the target features to have sufficient modeling ability for complex semantic relationships in remote sensing images without the introduction of external detection modules. This lays a high-dimensional semantic representation foundation for the subsequent generation of accurate and comprehensive semantic descriptions.
[0064] Optionally, the mapping structure includes multiple transformer modules. To further improve the accuracy of determining target features, in the semantic description generation method for remote sensing images provided in Embodiment 1 of this application, an initial combined vector is input to the first transformer module in the mapping structure; the vector output by the first transformer module is used as the input vector of the next transformer module in the mapping structure, until the output vector of the last transformer module in the mapping structure is obtained; and the output vector is represented as the target feature.
[0065] In this embodiment of the invention, the initial combined vector is input to the first transformer module in the mapping structure. The initial combined vector is generated by the projection layer and the stitching operation, and is composed of k initial visual vectors and... A joint vector sequence formed by concatenating initial query vectors along their dimensions serves as the initial input to the mapping structure. The mapping structure consists of... A series of identical transformer modules are stacked sequentially. Each transformer module contains a multi-head self-attention mechanism and a feedforward neural network. The first transformer module receives an initial combined vector, calculates the global dependencies between vectors in the sequence using the multi-head self-attention mechanism, and performs a non-linear transformation using the feedforward neural network, outputting the intermediate vector sequence optimized in the first round. The vector output from the first transformer module is used as the input vector for the next transformer module in the mapping structure, and so on, until the second, third, and so on. Each transformer module takes the output of the previous module as input and repeatedly performs attention interactions and feature mapping operations. Finally, the output vector of the last transformer module in the mapping structure is obtained, which contains the output vector of the previous module. The visual and query information enhanced by deep semantic interaction and contextualization is represented as target features, that is, the joint set of target visual features and target query vector, which serves as the input condition for the decoder.
[0066] In this embodiment, by inputting the initial combined vector layer by layer into the stacked transformer module, the visual features and query vectors are progressively interacted and optimized in a multi-level semantic space. This enables the final output target features to have high semantic consistency and context awareness, effectively improving the model's modeling accuracy for the complex semantic structure of remote sensing images and providing a sufficient and stable feature input foundation for generating accurate and detailed semantic descriptions.
[0067] To improve the accuracy of the processing flow of the transformer module, in the semantic description generation method for remote sensing images provided in Embodiment 1 of this application, the input vector is linearly transformed based on the preset weight matrix of the transformer module to obtain the query projection, key projection, and value projection; the dot product between the query projection and the key projection is determined to obtain the attention weight, and the attention weight is normalized; based on the normalized attention weight and the value projection, the head output vector is determined; all head output vectors are concatenated to obtain the concatenated vector, and a preset operation is performed on the concatenated vector to obtain the initial output vector; a multilayer perceptron is used to process the initial output vector to obtain the output vector.
[0068] In this embodiment of the invention, the weight matrix of the converter module can be used as a basis first. as well as (Obtained through model training), the input vector undergoes a linear transformation to obtain the query projection, key projection, and value projection. Here, the input vector is the vector sequence output by the previous module; the preset weight matrix includes three sets of learnable linear transformation matrices, respectively... as well as These are used to linearly project the input vector into query projection Q, key projection K, and value projection V, respectively. Then, the transposes of query projection Q and key projection K are multiplied by a dot product to obtain the attention score matrix, which is then applied to each row of this matrix. The function is normalized to obtain a normalized attention weight matrix, whose elements represent the relative importance of each position vector to the current query. Then, the normalized attention weights are multiplied by the value projection V, and the weighted sum is calculated independently for each attention head, generating S head output vectors. Each head output vector represents the semantic attention pattern of a different subspace. The S head output vectors are then concatenated along the feature dimension to form a concatenated vector. This is then processed through a predefined linear transformation matrix. (Obtained through model training) Perform linear mapping (i.e., perform pre-defined operations, such as through...) The process begins with an initial output vector, whose dimension matches that of the input vector. Then, a multilayer perceptron is used to process this initial output vector to obtain the final output vector. The multilayer perceptron consists of two cascaded fully connected layers connected by a ReLU (Rectified Linear Unit) activation function and a Dropout layer. The first layer expands the dimension of the initial output vector, and the second layer reduces the expanded dimension back to the original dimension. The final output vector, nonlinearly enhanced, serves as the final output of the transformer module.
[0069] For example, learnable query vectors and A combination of visual vectors Then, continue with the subsequent execution. The Transformer module. Among them, the... The specific operations of the Transformer module are as follows:
[0070]
[0071]
[0072]
[0073]
[0074] in, Presentation layer standardization This represents the multi-head attention mechanism used in each Transformer block. Specifically, for the input feature set... The multi-head attention mechanism first generates projections of the query (Q), key (K), and value (V), which are obtained through a learned weight matrix. as well as The input features are obtained by performing a linear transformation. Next, as shown in equation (4), the dot product between the query and the key is calculated to obtain the attention weights. and through The function normalizes the attention weights so that the sum of the attention weights for each head is 1, where d represents the dimension. Finally, based on the normalized attention weights... The values (V) are weighted and summed to obtain the output of each head. The output of all heads passes through The current output is obtained after the operation. After the operation, the final output of the current Transformer module can be obtained using the following formula:
[0075]
[0076] in, It is a cascaded structure containing two elements. A multilayer perceptron with (Fully Connected – Rectified Linear Unit – Dropout) units. that is The output of the Transformer module.
[0077] After passing through multiple stacked Transformer modules, a learnable query vector is obtained from the image feature optimization module. and containing rich image features Both serve as input to the decoder for generating descriptive statements. By mapping and extracting visual features, and employing an attention mechanism, while simultaneously utilizing learnable query vectors to predict rich semantic information in remote sensing images, a more context-relevant and semantically rich visual representation is obtained. The design of an image feature optimization method based on attention mechanisms and semantic feature prediction operators effectively handles the complexity of visual information.
[0078] In this embodiment, the input features are deeply optimized at both the semantic association modeling and nonlinear enhancement levels through the synergistic effect of the multi-head attention mechanism and the multi-layer perceptron. This enables the transformer module to effectively capture the semantic dependencies of multiple targets and multiple scales in remote sensing images, significantly improve the semantic integrity and discriminability of target features, and provide a highly discriminative feature foundation for generating semantic descriptions that conform to the true content of the images.
[0079] To improve the prediction accuracy of the preset generation model, in the semantic description generation method for remote sensing images provided in Embodiment 1 of this application, before processing the initial visual features and initial query vector using the preset generation model to generate the target semantic description of the target remote sensing image, an initial generation model is constructed. The model structure of the initial generation model includes at least: an initial image feature optimization structure and an initial preset decoder. A historical remote sensing image set is collected, and each historical remote sensing image in the set is labeled to obtain labeling information. The labeling information includes at least: a set of real semantic features and real descriptive statements. The initial generation model is trained using the historical remote sensing image set and the labeling information corresponding to each historical remote sensing image until the target loss value is less than a preset loss threshold, thus obtaining the preset generation model. The target loss value is determined by a target loss function, which includes: a query loss value and a description statement generation loss value. The target loss function is composed of a query loss function and a description statement generation loss function.
[0080] In this embodiment of the invention, an initial generation model can be constructed first. The model structure of the initial generation model includes: an initial image feature optimization structure and an initial preset decoder. The initial image feature optimization structure consists of a projection layer and a stack. The layer consists of a transformer module used to jointly optimize the initial visual features extracted by the CLIP visual encoder with the learnable initial query vector; the initial preset decoder is a stacked... The Transformer decoder layer takes optimized visual and semantic features as input and outputs a sequence of natural language descriptions. Then, a set of historical remote sensing images is collected, and each image in the set is labeled to obtain annotation information, which includes: a set of true semantic features and true descriptive sentences. The set of true semantic features consists of semantic words appearing in the labeled images, such as "reservoir," "road," "farmland," and "buildings," the quantity and content of which are determined based on the image content. The true descriptive sentences are complete natural language sentences manually written to describe the image content, such as "There is a rectangular reservoir in the center of the image, surrounded by farmland, and a road runs through it to the northeast." Subsequently, the initial generative model is trained using the set of historical remote sensing images and the corresponding annotation information for each historical image until the target loss value is less than a preset loss threshold, resulting in the preset generative model. During training, the model takes each image as input and outputs a set of predicted semantic features and a predicted description. It calculates the target loss value using a target loss function, which is the sum of the query loss and the description generation loss. The query loss is calculated using the Hungarian algorithm to match the predicted query vector with the actual semantic feature set, followed by a cross-entropy loss function. The description generation loss uses the standard sequence cross-entropy loss function to measure the difference between the predicted word sequence and the actual description. When the target loss value falls below a preset loss threshold (e.g., 0.15) for multiple consecutive rounds during training iterations, training stops, the model parameters are saved, and the preset generation model is obtained.
[0081] In this embodiment, an initial generative model containing an image feature optimization structure and a decoder is constructed, and joint supervised training is performed using labeled remote sensing image data. A composite loss function is used to simultaneously optimize the semantic feature prediction and description generation capabilities, so that the final preset generative model can achieve accurate understanding and natural language description of complex semantics of remote sensing images without the need for an external detector, significantly improving the semantic integrity, accuracy and semantic consistency of the generated results.
[0082] To achieve accurate and efficient training of the model, in the semantic description generation method for remote sensing images provided in Embodiment 1 of this application, during the training of the initial generation model using a set of historical remote sensing images and the annotation information corresponding to each historical remote sensing image, a bipartite graph is constructed based on the predicted query vector output by the optimized structure of the initial image features and the set of real semantic features. The number of query operators in the predicted query vector is greater than the number of real semantic features in the set of real semantic features. Based on the bipartite graph, each query operator is matched with each real semantic feature to obtain the matching cost of each matching result. The matching result includes: the query operator corresponding to each real semantic feature; the matching result corresponding to the minimum matching cost is determined as the target matching result; based on the query operator corresponding to each real semantic feature indicated by the target matching result, a query loss function is used to determine the current query loss value; based on the predicted semantic description output by the initial preset decoder and the real description statement, a description statement generation loss function is used to determine the current description statement generation loss value; based on the current query loss value and the current description statement generation loss value, the current loss value is determined; if the current loss value is less than a preset loss threshold, the current loss value is determined as the target loss value, and training is stopped.
[0083] In this embodiment of the invention, the final output of the initial image feature optimization structure is obtained. Based on the predicted query vector (used to predict semantic information in images), and combined with a true set of semantic features. Assuming the number of predicted query vectors Greater than the actual number of semantic words A bipartite graph is constructed, with the predicted query operators and the ground truth words as two sets of nodes. In the bipartite graph, each query operator attempts to match with a ground truth word, and each ground truth word searches for the most relevant query operator. The Hungarian algorithm then searches for the optimal matching scheme in this bipartite graph, minimizing the total matching cost. The calculation of the predicted query vectors with the lowest cost matching using the Hungarian algorithm is as follows:
[0084] ;
[0085] in, Represents true semantic features. Indicates and Matching query operators, This represents the matching cost.
[0086] By minimizing costs The system predicts the matching between the query vector and the actual semantics. Then, based on the determined matching, it calculates the current query loss value using a query loss function. The query loss function expression is as follows:
[0087] ;
[0088] in, This indicates the true semantic meaning of the current match after the entire matching process is completed. The corresponding query. For all matching pairs, the cross-entropy loss function is applied. To optimize, the true labels of unmatched query vectors are set to... , This represents the number of semantic words in the current image description dataset, indicating that the query vector is not involved in the loss calculation.
[0089] In this embodiment of the invention, the target description representation of the current input image (i.e., the predicted semantic description output by the initial preset decoder) is as follows: ,in, This refers to the number of words in the description statement. Then, the description statement... Word embedding transforms the input words into dense vector representations, represented as... After going through the same process based on the number of stacking layers... After the decoder of the Transformer, at time The final generated descriptive words are represented as Used to predict the true description statement The 1 word. For description generation, the cross-entropy loss function is used to train the model:
[0090]
[0091] in, Represents training parameters, It is a moment Descriptive words, This indicates that it is at a given time. Descriptive word sequence Generate words in the case of The predicted probability.
[0092] The model's total loss function is composed of the queries loss from the image feature optimization module. and the loss generated by the description statement Composed of: .
[0093] If the current loss value is less than the preset loss threshold, the current loss value is set as the target loss value, and training stops. For example, if the preset loss threshold is set to 0.15, and the current loss value is lower than this threshold for 5 consecutive training epochs, the model is considered to have converged, training is terminated, and the model parameters are saved as the preset generated model.
[0094] In this embodiment, by constructing a bipartite graph matching mechanism and a composite loss function, the global optimal alignment between the query operator and the real semantic features is achieved. Furthermore, the semantic prediction and language generation tasks are jointly optimized, enabling the model to accurately identify the semantics of multiple types of land features in remote sensing images without the need for an external detector, and to generate natural language descriptions that are strictly consistent with the image content. This significantly improves the completeness of semantic coverage, matching accuracy, and semantic consistency of the generated text.
[0095] Figure 4 This is a schematic diagram of an optional model structure based on attention mechanism and semantic feature prediction operator according to an embodiment of the present invention, such as... Figure 4 As shown, it includes: CLIP Encoder (visual feature encoder), Transformer mapping module (including: Projection Layer, Transformer Block (Transformer-based attention module)), and Transformer Decoder Block (decoder). The input is a remote sensing image. After extraction by CLIP's visual feature encoder, the image visual features are obtained. (Image Features) As input to the Transformer mapping module, it first passes through the Projection Layer to generate... visual vectors Then, combined with Queries (query vectors) Each Transformer Block processes the target visual tokens and target query tokens. The combined visual tokens and query tokens are then input into the TransformerDecoder Block to generate a semantic description.
[0096] In this invention, a single-stage remote sensing image description method based on an attention mechanism and semantic feature prediction operators is proposed. This method can better handle the complexity and diversity of remote sensing images, generating richer and more accurate descriptive statements. First, dense image features extracted by the image feature extractor are remapped. Then, an added attention mechanism imbues the generated visual features with rich contextual information. Addressing the difficulty of effectively aligning the image visual feature space with the semantic feature space of the descriptive statement in existing single-stage methods, a semantic feature prediction operator is proposed. These learnable query operators interact with the remapped visual features through attention, guiding the image visual features to focus more on key regions in the remote sensing image containing rich semantic information. Through semantic information supervision, the query operators can directly learn key semantic information from image features, thereby significantly improving the model's understanding and expression of remote sensing image content, making the generated descriptions closer to the true semantics of the image.
[0097] The following is a detailed description with reference to another embodiment.
[0098] Example 2
[0099] The semantic description generation device for remote sensing images provided in this embodiment includes multiple implementation units, each of which corresponds to a specific implementation step in Embodiment 1 above.
[0100] Figure 5 This is a schematic diagram of an optional semantic description generation device for remote sensing images according to an embodiment of the present invention, such as... Figure 5 As shown, the semantic description generation device may include: a receiving unit 50, an extraction unit 51, and a processing unit 52.
[0101] The receiving unit 50 is used to receive remote sensing images of the target.
[0102] Extraction unit 51 is used to extract features from the target remote sensing image using an image feature extractor to obtain initial visual features;
[0103] Processing unit 52 is used to determine the initial query vector and process the initial visual features and the initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image. The model structure of the preset generation model includes at least a preset image feature optimization structure and a preset decoder. The preset image feature optimization structure is used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder is used to process the target features to obtain a target semantic description.
[0104] The aforementioned semantic description generation device can adopt a collaborative processing approach between an image feature optimization structure based on a Transformer and a learnable query vector. By inputting the initial visual features output by the image feature extractor and the initial query vector into a preset image feature optimization structure, a multi-layer attention mechanism is used to remap and enhance the context of the visual features. The query vector is then guided to adaptively learn key semantic information from the image under semantic supervision, generating target features rich in semantic associations. A preset decoder then gradually generates the target semantic description based on these target features, thereby improving the alignment accuracy between visual features and semantic expression space. This achieves efficient modeling of complex features and semantic relationships in remote sensing images under a single-stage architecture, thus solving the technical problem of incomplete descriptions and logical deviations caused by dense visual features and ambiguous semantics in traditional single-stage methods. This significantly improves the accuracy and completeness of semantic descriptions of remote sensing images.
[0105] Optionally, the processing unit includes: a first processing module, used to process the initial visual features and the initial query vector using a preset image feature optimization structure to obtain target features, wherein the target features include: target visual features and target query vector; and a second processing module, used to process the target features using a preset decoder to generate a target semantic description.
[0106] Optionally, the preset image feature optimization structure includes at least a projection layer and a mapping structure. The first processing module includes: a first transformation submodule, used to use the projection layer to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence; a first combination submodule, used to combine the initial visual vector sequence and the initial query vector to obtain an initial combination vector; and a first processing submodule, used to use the mapping structure to process the initial combination vector to obtain the target features.
[0107] Optionally, the mapping structure includes: multiple transformer modules, and the first processing submodule includes: a first input submodule, used to input the initial combined vector to the first transformer module in the mapping structure; a first input submodule, used to use the vector output by the first transformer module as the input vector of the next transformer module in the mapping structure, until the output vector of the last transformer module in the mapping structure is obtained; and a first representation submodule, used to represent the output vector as target features.
[0108] Optionally, the third processing module includes: a first transformation submodule, used to perform a linear transformation on the input vector based on the preset weight matrix of the transformer module to obtain the query projection, key projection, and value projection; a first determination submodule, used to determine the dot product between the query projection and the key projection to obtain the attention weights, and to normalize the attention weights; a second determination submodule, used to determine the head output vector based on the normalized attention weights and the value projection; a first concatenation submodule, used to concatenate all the head output vectors to obtain the concatenated vector, and to perform a preset operation on the concatenated vector to obtain the initial output vector; and a second processing submodule, used to process the initial output vector using a multilayer perceptron to obtain the output vector.
[0109] Optionally, the semantic description generation device further includes: a first construction module, used to construct an initial generation model before processing the initial visual features and initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image, wherein the model structure of the initial generation model includes at least: an initial image feature optimization structure and an initial preset decoder; a first annotation module, used to collect a historical remote sensing image set and annotate each historical remote sensing image in the historical remote sensing image set to obtain annotation information, wherein the annotation information includes at least: a set of real semantic features and real descriptive statements; and a first training module, used to train the initial generation model using the historical remote sensing image set and the annotation information corresponding to each historical remote sensing image until the target loss value is less than a preset loss threshold to obtain a preset generation model, wherein the target loss value is determined by a target loss function, the target loss value includes: a query loss value and a description statement generation loss value, and the target loss function is composed of a query loss function and a description statement generation loss function.
[0110] Optionally, the semantic description generation device further includes: a second construction module, used to construct a bipartite graph based on the predicted query vector output by the optimized structure of the initial image feature and the set of real semantic features during the training of the initial generation model using a set of historical remote sensing images and the annotation information corresponding to each historical remote sensing image; wherein the number of query operators in the predicted query vector is greater than the number of real semantic features in the set of real semantic features; a first matching module, used to match each query operator with each real semantic feature based on the bipartite graph to obtain the matching cost of each matching result; wherein the matching result includes: the query operator corresponding to each real semantic feature; and a first determination module. The first module is used to determine the matching result corresponding to the minimum matching cost as the target matching result; the second module is used to determine the current query loss value based on the query operator corresponding to each real semantic feature indicated by the target matching result and using a query loss function; the third module is used to determine the current description statement generation loss value based on the predicted semantic description output by the initial preset decoder and the real description statement and using a description statement generation loss function; the fourth module is used to determine the current loss value based on the current query loss value and the current description statement generation loss value; and the fifth module is used to determine the current loss value as the target loss value and stop training if the current loss value is less than a preset loss threshold.
[0111] The semantic description generation device described above may also include a processor and a memory. The receiving unit 50, the extraction unit 51, the processing unit 52, etc., are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize the corresponding functions.
[0112] The aforementioned processor contains a kernel, which retrieves the corresponding program units from memory. One or more kernels can be configured. By adjusting kernel parameters, the initial query vector is determined, and a preset generation model is used to process the initial visual features and the initial query vector to generate a target semantic description of the remote sensing image.
[0113] The aforementioned memory may include non-permanent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.
[0114] The present invention also provides a computer program product, which, when executed on a data processing device, is suitable for executing an initialization program having the following method steps: receiving a target remote sensing image, using an image feature extractor to extract features from the target remote sensing image to obtain initial visual features, determining an initial query vector, and using a preset generation model to process the initial visual features and the initial query vector to generate a target semantic description of the target remote sensing image.
[0115] According to another aspect of the present invention, a computer program product is also provided, including a non-volatile computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method for generating semantic descriptions of remote sensing images as described above.
[0116] According to another aspect of the present invention, an electronic device is also provided, including one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the above-described method for generating semantic descriptions of remote sensing images.
[0117] Figure 6 This is a hardware structure block diagram of an electronic device (or mobile device) for generating semantic descriptions of remote sensing images according to an embodiment of the present invention. Figure 6 As shown, an electronic device may include one or more processors (e.g., Figure 6 The processors 602a, 602b, ..., 602n, etc., may include, but are not limited to, processing devices such as microprocessors (MCUs) or programmable logic devices (FPGAs), and a memory 604 for storing data. In addition, it may include: a display, an input / output interface (I / O interface), a universal serial bus (USB) port (which may be included as one of the ports of the I / O interface), a network interface, a keyboard, a power supply, and / or a camera. Those skilled in the art will understand that... Figure 6 The structure shown is for illustrative purposes only and does not limit the structure of the electronic device described above. For example, the electronic device may also include components that are more... Figure 6 The more or fewer components shown, or having the same Figure 6 The different configurations shown.
[0118] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0119] The embodiments or examples disclosed herein are not exhaustive, but merely illustrative of some embodiments or examples, and are not intended to limit the scope of protection of this disclosure. Unless otherwise specified, each step in a particular embodiment or example can be implemented as an independent embodiment, and the steps can be arbitrarily combined. For example, a solution after removing some steps in a particular embodiment or example can also be implemented as an independent embodiment, and the order of the steps in a particular embodiment or example can be arbitrarily interchanged. Furthermore, optional methods or examples in a particular embodiment or example can be arbitrarily combined; moreover, embodiments or examples can be arbitrarily combined. For example, some or all steps of different embodiments or examples can be arbitrarily combined, and a particular embodiment or example can be arbitrarily combined with optional methods or examples of other embodiments or examples.
[0120] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0121] In the several embodiments provided by this invention, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection can be through some interfaces; the indirect coupling or communication connection of units or modules can be electrical or other forms.
[0122] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0123] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0124] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0125] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for generating semantic descriptions of remote sensing images, characterized in that, include: Receive remote sensing images of the target; An image feature extractor is used to extract features from the remote sensing image of the target to obtain initial visual features; An initial query vector is determined, and a preset generation model is used to process the initial visual features and the initial query vector to generate a target semantic description of the target remote sensing image. The preset generation model includes at least a preset image feature optimization structure and a preset decoder. The preset image feature optimization structure is used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder is used to process the target features to obtain the target semantic description.
2. The semantic description generation method according to claim 1, characterized in that, The step of processing the initial visual features and the initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image includes: The initial visual features and the initial query vector are processed using the preset image feature optimization structure to obtain the target features, wherein the target features include: target visual features and target query vector; The target features are processed using the preset decoder to generate the target semantic description.
3. The semantic description generation method according to claim 2, characterized in that, The preset image feature optimization structure includes at least a projection layer and a mapping structure. The step of processing the initial visual features and the initial query vector using the preset image feature optimization structure to obtain the target features includes: The projection layer is used to convert the initial visual features into a preset number of visual vectors to obtain an initial visual vector sequence. The initial visual vector sequence and the initial query vector are combined to obtain an initial combined vector; The initial combined vector is processed using the mapping structure to obtain the target features.
4. The semantic description generation method according to claim 3, characterized in that, The mapping structure includes: multiple transformer modules; the step of processing the initial combined vector using the mapping structure to obtain the target features includes: The initial combined vector is input into the first transformer module in the mapping structure; The vector output by the first converter module is used as the input vector of the next converter module in the mapping structure, until the output vector of the last converter module in the mapping structure is obtained; The output vector is represented as the target feature.
5. The semantic description generation method according to claim 4, characterized in that, The processing steps for each converter module include: Based on the preset weight matrix of the transformer module, the input vector is linearly transformed to obtain query projection, key projection and value projection; The dot product between the query projection and the key projection is determined to obtain the attention weight, and the attention weight is then normalized. The head output vector is determined based on the normalized attention weights and the value projection. All the head output vectors are concatenated to obtain a concatenated vector, and a preset operation is performed on the concatenated vector to obtain an initial output vector; The initial output vector is processed using a multilayer perceptron to obtain the output vector.
6. The semantic description generation method according to claim 1, characterized in that, Before processing the initial visual features and the initial query vector using a preset generation model to generate the target semantic description of the target remote sensing image, the method further includes: Construct an initial generation model, wherein the model structure of the initial generation model includes at least: an initial image feature optimization structure and an initial preset decoder; A set of historical remote sensing images is collected, and each historical remote sensing image in the set is labeled to obtain labeling information, wherein the labeling information includes at least: a set of real semantic features and real descriptive statements; The initial generation model is trained using the historical remote sensing image set and the annotation information corresponding to each historical remote sensing image until the target loss value is less than a preset loss threshold, thereby obtaining the preset generation model. The target loss value is determined by a target loss function, which includes a query loss value and a description statement generation loss value. The target loss function is composed of a query loss function and a description statement generation loss function.
7. The semantic description generation method according to claim 6, characterized in that, The process of training the initial generation model using the historical remote sensing image set and the annotation information corresponding to each historical remote sensing image also includes: Based on the predicted query vector output by the initial image feature optimization structure and the real semantic feature set, a bipartite graph is constructed, wherein the number of query operators in the predicted query vector is greater than the number of real semantic features in the real semantic feature set. Based on the bipartite graph, each query operator is matched with each real semantic feature to obtain the matching cost of each matching result, wherein the matching result includes: the query operator corresponding to each real semantic feature; The matching result corresponding to the minimum matching cost is determined as the target matching result; Based on the query operator corresponding to each of the real semantic features indicated by the target matching result, the current query loss value is determined using the query loss function. Based on the predicted semantic description output by the initial preset decoder and the actual description statement, the current description statement generation loss value is determined using the description statement generation loss function. A loss value is generated based on the current query loss value and the current description statement; the current loss value is determined. If the current loss value is less than the preset loss threshold, the current loss value is determined as the target loss value, and training is stopped.
8. A semantic description generation device for remote sensing images, characterized in that, include: The receiving unit is used to receive remote sensing images of the target. An extraction unit is used to extract features from the target remote sensing image using an image feature extractor to obtain initial visual features; A processing unit is configured to determine an initial query vector and process the initial visual features and the initial query vector using a preset generation model to generate a target semantic description of the target remote sensing image. The preset generation model includes at least a preset image feature optimization structure and a preset decoder. The preset image feature optimization structure is used to process the initial visual features and the initial query vector to obtain target features, and the preset decoder is used to process the target features to obtain the target semantic description.
9. A computer program product, characterized in that, The method includes a non-volatile computer-readable storage medium storing a computer program that, when executed by a processor, implements the semantic description generation method for remote sensing images according to any one of claims 1 to 7.
10. An electronic device, characterized in that, The method includes one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the semantic description generation method for remote sensing images according to any one of claims 1 to 7.