A remote sensing image change semantic generation method and device

By introducing reconstructive causal filtering and frequency domain focusing mechanisms into the description of changes in remote sensing images, and combining them with object-level prior features, the problems of perspective ambiguity and knowledge ambiguity are solved, and high-precision change description in complex scenes is achieved.

CN122244710APending Publication Date: 2026-06-19HARBIN INST OF TECH AT WEIHAI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HARBIN INST OF TECH AT WEIHAI
Filing Date
2026-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing remote sensing image change description technologies struggle to simultaneously resolve perspective ambiguity, scale ambiguity, and knowledge ambiguity within a unified framework, resulting in insufficient robustness and semantic accuracy in complex scenes, changes to small targets, mixed changes to multiple objects, and unchanged samples.

Method used

Within a unified spatiotemporal modeling framework, a reconstructive causal filtering, frequency domain focusing enhancement, and object-prior-driven reasoning mechanism are introduced. By constructing a three-frame short video sequence, joint spatiotemporal features are extracted, causal enhancement and feature refocusing are performed, and natural language descriptions are generated by combining object-level prior features.

Benefits of technology

It significantly improves the ability to identify and describe changes in small targets, mixed changes in multiple targets, and no-change scenarios, generating more accurate, detailed, and robust natural language descriptions of changes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244710A_ABST
    Figure CN122244710A_ABST
Patent Text Reader

Abstract

This invention proposes a method and device for generating semantic descriptions of changes in remote sensing images. The method includes: constructing a three-frame short video sequence using pre-temporal remote sensing images, post-temporal remote sensing images, and a change mask of the same geographic region; extracting joint spatiotemporal features; reconstructing post-temporal features by fusing pre-temporal and change features, and using contrastive learning constraints to reconstruct the matching relationship between the reconstructed features and the true post-temporal features, obtaining causal enhancement features; refocusing the causal enhancement features in the frequency domain to enhance high-frequency details and suppress low-frequency background, obtaining fused features, which are then decoupled and weighted to aggregate, outputting ground truth features of changes; generating object-level prior features based on object query reasoning based on the ground truth features of changes; and inputting the ground truth features of changes and object-level prior features into a differential text decoder to output change description text. A corresponding device is also proposed based on this method. This invention significantly improves the accuracy and robustness of semantic descriptions of changes in ground features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent analysis of remote sensing images and computer vision technology, and specifically relates to a method and device for generating semantic changes in remote sensing images. Background Technology

[0002] With the rapid development of remote sensing imaging technology and Earth observation capabilities, satellites, drones, and other carriers can continuously acquire high spatial and temporal resolution surface image data, forming massive dual-temporal or multi-temporal remote sensing image sequences. This type of data plays an increasingly important role in resource management, ecological environment monitoring, urban planning and construction, emergency disaster relief, and security defense. To automatically discover and understand changes in ground features from massive amounts of remote sensing images, academia and industry have proposed various deep learning-based methods for remote sensing image change detection and semantic understanding. Traditional remote sensing change analysis typically relies on pixel difference, texture features, or deep convolutional features, using threshold segmentation, convolutional neural networks, or dual-branch networks to detect and classify changed areas. While this can provide the location and category of changed areas to some extent, it is difficult to interpret from a natural language perspective, failing to meet the needs of intelligent monitoring, interpretability, and interactivity. To improve the readability and interpretability of remote sensing change analysis results, tasks and methods for describing changes in remote sensing images have emerged in recent years. This task takes dual-temporal remote sensing images as input and outputs natural language descriptions of ground feature changes. Existing change description systems typically employ an "image encoder + feature difference / fusion + text decoder" framework: on the one hand, convolutional neural networks or visual Transformers are used to extract features from images of different time phases; on the other hand, change features are obtained through feature difference, cross attention, or splicing fusion; and finally, descriptive statements are generated through recurrent neural networks, Transformer text decoders, etc.

[0003] While the aforementioned methods have achieved good evaluation metrics on some publicly available datasets, they still have significant shortcomings in complex remote sensing scenarios, mainly in the following aspects: First, remote sensing images are usually acquired from a top-down perspective, and the appearance of ground objects in the images differs greatly from the natural scene. Simply relying on local textures or simple difference features is insufficient to accurately distinguish specific ground object categories. Many existing change description methods only treat before-and-after images as static image pairs for comparative modeling, lacking temporal and global contextual modeling of the change process. This leads to confusion between similar-looking ground objects and fails to effectively eliminate perspective and morphological ambiguities. Second, existing deep neural networks commonly employ multiple downsampling and pooling operations to expand the receptive field and reduce computational load, resulting in the layer-by-layer compression or even erasure of high-resolution details. Furthermore, some methods are insufficiently responsive to high-frequency details, leading to unclear feature contrasts in small-scale change areas, resulting in missed changes or overly coarse descriptions that fail to finely characterize the change structure. Third, remote sensing image change description not only needs to determine "whether a change has occurred" and "the approximate type of change," but also requires accurate differentiation among multiple ground object categories and identification of specific change behaviors. Most existing methods rely on supervised learning of remote sensing data using visual backbone networks, lacking explicit object-level priors and domain knowledge. Insufficient guidance can easily lead to problems such as descriptive generalization, incorrect feature naming, or omission of key changes. Fourth, some methods introduce independent change detection modules, first generating a binary change mask, and then multiplying the mask element-wise with image features to highlight changed areas and suppress unchanged areas. However, this "hard filtering" approach often simultaneously erases important background information surrounding the changed area, while contextual information is crucial for distinguishing similar-looking features and understanding the semantics of change. Fifth, changes in remote sensing images are essentially the evolution of the same region's state at different times, with the change outcome jointly determined by "pre-change state + change behavior." Most existing methods only model from the perspective of the correlation between "before and after" differences, rarely explicitly introducing "from before to after" causal constraints, and failing to truly treat dual-temporal images and change information as short video sequences for unified spatiotemporal modeling.

[0004] In summary, existing remote sensing image change description technologies often improve change modeling from a single perspective, making it difficult to simultaneously address key challenges such as perspective ambiguity, scale ambiguity, and knowledge ambiguity within a unified framework. This results in room for improvement in robustness and semantic accuracy in complex scenes, changes to small targets, mixed changes to multiple objects, and unchanged samples. Summary of the Invention

[0005] To address the aforementioned technical problems, this invention proposes a method and device for generating semantic representations of changes in remote sensing images. Within a unified spatiotemporal modeling framework, it introduces reconstructive causal filtering, frequency domain focusing enhancement, and object-prior-driven reasoning mechanisms, effectively mitigating perspective ambiguity, scale ambiguity, and knowledge ambiguity. Simultaneously, it significantly improves the ability to discriminate and describe changes in small targets, mixed changes in multiple targets, and unchanged scenes.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: In a first aspect, the present invention proposes a method for generating semantic changes in remote sensing images, comprising the following steps: Using pre-temporal remote sensing images, post-temporal remote sensing images, and change masks of the same geographic area, a three-frame short video sequence is constructed; the joint spatiotemporal features of the three-frame short video sequence are extracted. By fusing the pre-temporal features from the joint spatiotemporal features with the change features extracted from the change mask, the post-temporal features are reconstructed. Then, the matching relationship between the post-temporal features and the real post-temporal features in the feature space is reconstructed using contrastive learning constraints, so as to highlight the causal structure of the change process and obtain causal enhancement features. The causal enhancement features are refocused in the frequency domain. After mapping the causal enhancement features to the frequency domain, the high-frequency detail response is enhanced and the low-frequency background components are suppressed. The processed features are then transformed back to the spatial domain to obtain the fused features. The variation features in the fused features are decoupled and weighted and aggregated with the global scene features to obtain the variation ground truth features. Based on the change truth features, object query reasoning is performed to generate object-level prior features; the change truth features and object-level prior features are input together into the differential text decoder to output natural language text describing the changes in ground features between the previous and current time-phase remote sensing images.

[0007] Secondly, the present invention proposes a remote sensing image change semantic generation device, including at least one processor and a memory, wherein the memory stores a computer program, and the computer program, when executed by the at least one processor, implements the remote sensing image change semantic generation method.

[0008] The effects described in the invention are merely those of the embodiments, and not all the effects of the invention. One of the above technical solutions has the following advantages or beneficial effects: This invention proposes a method and device for generating semantic representations of changes in remote sensing images. The method includes the following steps: constructing a three-frame short video sequence using pre-temporal remote sensing images, post-temporal remote sensing images, and a change mask of the same geographic region; extracting joint spatiotemporal features from the three-frame short video sequence; reconstructing post-temporal features by fusing pre-temporal features from the joint spatiotemporal features with change features extracted from the change mask, and using contrastive learning constraints to reconstruct the matching relationship between post-temporal features and true post-temporal features in the feature space to highlight the causal structure of the change process, thus obtaining causal enhanced features; refocusing the causal enhanced features in the frequency domain, mapping the causal enhanced features to the frequency domain to enhance high-frequency detail responses and suppress low-frequency background components, and then transforming the processed features back to the spatial domain to obtain fused features; decoupling and weighting the change features in the fused features with global scene features to obtain change ground truth features; performing object query reasoning based on the change ground truth features to generate object-level prior features; inputting the change ground truth features and object-level prior features into a differential text decoder to output natural language text describing the changes in ground features between pre- and post-temporal remote sensing images. Based on a method for generating semantic descriptions of changes in remote sensing images, a device for generating semantic descriptions of changes in remote sensing images is also proposed. This invention enables spatiotemporal modeling of dual-temporal remote sensing images and change masks as short video sequences within a unified deep learning framework. It enhances the temporal consistency and causal rationality of feature representations through reconstructive causal constraints, strengthens the response of small-scale changing targets through a frequency domain focusing mechanism, and combines global context and object-level priors for reasoning. Thus, while resolving various ambiguities, it generates more accurate, detailed, and robust natural language descriptions of changes.

[0009] This invention treats the preceding temporal image, the changing mask, and the following temporal image as three consecutive short video frames for unified spatiotemporal feature modeling, rather than processing the two images separately. The changing mask is explicitly embedded in the temporal stream as an intermediate frame, guiding the spatiotemporal features to focus on areas that may change, while preserving complete contextual information, effectively solving the problems of viewpoint ambiguity and morphological ambiguity.

[0010] This invention utilizes a reconstructive causal filtering module to fuse preceding and changing features to reconstruct subsequent features, and then employs contrastive learning constraints to establish a matching relationship between the reconstructed features and the true subsequent features. This explicit causal constraint enables the model to understand change from a process perspective, rather than solely relying on the correlation between preceding and following differences, significantly enhancing its ability to understand the semantics of change.

[0011] This invention maps features to the frequency domain and then adaptively enhances high-frequency detail components and suppresses low-frequency background components through learnable channel and spatial masks, effectively improving the response capability to targets with small-scale changes. Simultaneously, through decoupling and weighted aggregation, it achieves the organic fusion of local variation details and global contextual information, avoiding the destruction of contextual information caused by simple hard filtering.

[0012] This invention introduces learnable object query embeddings and interaction with change ground truth features to mine potential ground objects in images and predict their change probabilities. This object-level prior knowledge guides the text decoding process, enabling the generated descriptions to accurately distinguish different ground object categories, identify specific change behaviors, and significantly improve the accuracy of semantic descriptions.

[0013] This invention comprehensively considers text generation loss, causal constraint loss, and object classification loss to construct a multi-task loss function for joint optimization training of the overall model. Through multi-task learning, the modules promote each other, jointly improving the overall performance of semantic change generation. Attached Figure Description

[0014] Figure 1 This is a flowchart of a remote sensing image change semantic generation method proposed in Embodiment 1 of the present invention; Figure 2 The images are the initial remote sensing image and the remote sensing image several years later, as proposed in Embodiment 1 of the present invention. Figure 3 This is a comparison chart of attention focus proposed in Embodiment 1 of the present invention; Figure 4 This is a schematic diagram illustrating the description results presented in Embodiment 1 of the present invention; Figure 5 A schematic diagram of the algorithm for a remote sensing image change semantic generation method proposed in Embodiment 1 of this invention. Figure 6 This is a schematic diagram of a remote sensing image change semantic generation device proposed in Embodiment 2 of the present invention. Detailed Implementation

[0015] To clearly illustrate the technical features of this solution, the invention will be described in detail below through specific embodiments and in conjunction with the accompanying drawings. The following disclosure provides many different embodiments or examples for implementing different structures of the invention. To simplify the disclosure of the invention, components and arrangements of specific examples are described below. Furthermore, reference numerals and / or letters may be repeated in different examples. This repetition is for simplification and clarity and does not in itself indicate a relationship between the various embodiments and / or arrangements discussed. It should be noted that the components illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components, processing techniques, and processes are omitted in this invention to avoid unnecessarily limiting the invention.

[0016] Example 1 Embodiment 1 of this invention proposes a method for generating semantic representations of changes in remote sensing images. This method addresses the shortcomings of existing remote sensing image change description techniques, which often improve change modeling from a single perspective. These techniques struggle to simultaneously resolve key challenges such as perspective ambiguity, scale ambiguity, and knowledge ambiguity within a unified framework. Consequently, robustness and semantic accuracy in complex scenes, small target changes, mixed changes of multiple objects, and unchanged samples need further improvement.

[0017] Figure 1 This is a flowchart of a remote sensing image change semantic generation method proposed in Embodiment 1 of the present invention; In step S1, the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask of the same geographical area are acquired; a three-frame short video sequence is constructed using the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask; and the joint spatiotemporal features of the three-frame short video sequence are extracted.

[0018] The technical solution of this invention can be widely applied in the following remote sensing Earth observation scenarios where semantic understanding of changes in dual-temporal images of the same area is required: Land use and land cover change monitoring: Detecting and semantically describing changes in cultivated land, deforestation, wetland degradation, and desertification evolution within the same geographical region; Urban Expansion and Planning Assessment: Generate semantic changes for the expansion of construction land, identification of new buildings, road extensions, and industrial park development within the same city; Infrastructure construction and operation and maintenance management: Tracking and describing changes in the progress of infrastructure construction, such as transportation network construction, power facility laying, and water conservancy project construction, within the same region; Ecological and environmental assessment and protection: semantic understanding of changes in vegetation cover, water area, and ecological restoration effects within the same nature reserve; Disaster Loss Assessment and Emergency Response: Rapidly describing changes in building damage, vegetation destruction, and terrain alterations in the same affected area before and after natural disasters such as earthquakes, floods, and fires; Public safety and law enforcement supervision: Monitoring changes and generating semantic reports for illegal land occupation, unauthorized construction, and unusual activities in border areas within the same region.

[0019] By organically combining and jointly modeling pre-temporal remote sensing images, post-temporal remote sensing images, and change masks, this invention can accurately understand the spatiotemporal change process of land features within the same geographic area and output fine-grained, highly semantically accurate change descriptions in natural language.

[0020] This application constructs a three-frame short video sequence, specifically as follows: Previous time-phase remote sensing images As the first frame, the changing mask As the second frame, a later-temporal remote sensing image As the third frame, construct the input triplet. They are stacked along the time dimension to form a video data format for input spatiotemporal feature coding networks.

[0021] Variation Mask Using a change detection network to analyze previous phase remote sensing images and post-temporal remote sensing images After processing, the change detection network is generated to identify pixel regions that have changed between the previous and subsequent time-phase remote sensing images, and outputs the change mask in the form of a binary image or a probability map.

[0022] The change detection network in this invention employs a Siamese network structure, a convolutional neural network, a Transformer, or a hybrid of both. The scope of protection of this invention is not limited to the network types specifically listed in Example 1; those skilled in the art can make reasonable selections based on actual circumstances.

[0023] This invention treats the preceding time-series remote sensing image, the change mask, and the following time-series remote sensing image as three consecutive short video sequences for unified spatiotemporal feature modeling, rather than processing the two images separately. The change mask is explicitly embedded as an intermediate frame into the time-series stream to guide the spatiotemporal features to focus on areas that may change.

[0024] Extracting the joint spatiotemporal features of a three-frame short video sequence involves inputting the three-frame short video sequence into a trained spatiotemporal feature encoding network. Joint spatiotemporal features are extracted through 3D convolution or video Transformer structures. ,in, .

[0025] In step S2, the subsequent temporal features are reconstructed by fusing the previous temporal features in the joint spatiotemporal features with the change features extracted from the change mask, and the matching relationship between the subsequent temporal features and the real subsequent temporal features in the feature space is reconstructed by using contrastive learning constraints, so as to highlight the causal structure of the change process and obtain causal enhancement features. The detailed process includes: analysis of joint spatiotemporal features A reconstructive causal filtering method is used, fusing previous-phase features with change features extracted from the change mask to reconstruct subsequent-phase features. The reconstructed subsequent-phase features are used as positive samples, the true subsequent-phase features as positive sample keys, and the subsequent-phase features of other samples as negative sample keys. A contrastive learning causal constraint loss is applied. ; In the feature space, constrain the matching relationship between the reconstructed temporal features and the true temporal features, and enhance the distinguishability with the temporal features of other samples, outputting causal enhancement features. , represented as .

[0026] In this invention, joint spatiotemporal features When performing reconstructive causal filtering, a reconstructive causal filtering module is designed, which is divided into a filtering reconstruction modulation unit and an inter-frame causal constraint unit. The reconstruction modulation unit suppresses irrelevant or noise components in the temporal features and highlights key information related to changes by linearly modulating and reconstructing the features of the preceding and following time phases. The inter-frame causal constraint unit fuses the features of the preceding time phase with the features of changes to generate pseudo-following time phase features. It also constrains the consistency between the real and pseudo-following time phase features in the feature space by contrast loss, while reducing their similarity with other sample pseudo-following time phase features, thereby explicitly characterizing the causal structure.

[0027] In step S3, the causal enhancement features are refocused in the frequency domain. After mapping the causal enhancement features to the frequency domain, the high-frequency detail response is enhanced and the low-frequency background components are suppressed. The processed features are then transformed back to the spatial domain to obtain the fused features. The variation features in the fused features are decoupled and weighted and aggregated with the global scene features to obtain the variation ground truth features. The detailed process includes: enhancing causal features By mapping to the frequency domain through discrete Fourier transform, learnable channel masks and spatial masks are applied to adaptively enhance high-frequency detail components and suppress low-frequency background components; then, the fused features are obtained by returning to the spatial domain through inverse Fourier transform. The fused features are decoupled and weighted, and the ground truth components representing local changes are extracted while retaining the global contextual components reflecting the overall layout. The ground truth features of change are then output. .

[0028] In the process of frequency domain refocusing of causal enhancement features, this invention constructs a context-refocusing module, combining a frequency refocusing attention unit with a context-decoupling aggregation unit. The frequency refocusing attention unit first decomposes and enhances features in the channel and spatial dimensions, then maps the features to the frequency domain. Through learnable channel and spatial masks, it adaptively suppresses low-frequency background components and enhances high-frequency detail responses, before returning to the spatial domain via inverse transformation. The context-decoupling aggregation unit decouples and weights the changing features from the global scene features, extracting ground truth features of local changes while retaining global contextual information reflecting the overall layout and relationships between features, ultimately resulting in a unified feature representation that balances local changes and overall context.

[0029] In step S4, object query reasoning is performed based on the change truth features to generate object-level prior features; the change truth features and object-level prior features are input together into the difference text decoder to output natural language text describing the changes in ground features between the previous and subsequent time-phase remote sensing images.

[0030] The process of generating object-level prior features based on changing truth value features for object query reasoning is as follows: Change the true value characteristics Interact with a set of learnable object query embeddings, and use an attention mechanism to calculate the similarity between each object query embedding and the changing ground truth features to mine potential ground objects in the image; Based on similarity, predict the category of each object and the probability of its change, construct object-level prior features to represent specific land cover categories and their change states, and output the object-level prior features. And apply object classification loss. To conduct oversight.

[0031] The process of inputting the aforementioned change truth features and object-level prior features into a differential text decoder to output natural language text describing the changes in ground features between different temporal remote sensing images is as follows: The true value features of the change With the object-level prior features The common input differential text decoder, based on the Transformer architecture, first models the context of the generated word sequence through a self-attention mechanism during text decoding. Then, it uses a cross-attention mechanism to focus on the ground truth features of changes and the object-level prior features respectively, in order to fuse visual change information and object semantic information to predict the next word word by word. Finally, a sentence describing the change is generated. and apply text generation loss. To conduct oversight.

[0032] This invention designs a prior knowledge-guided differential decoder in the text generation stage. It takes the change truth features output by the context-enhanced feature module and the object-level prior features output by the object query reasoning module as conditional inputs. Inside the decoder, through multiple layers of self-attention and multiple cross-attention, it alternately aggregates language context, change region features and object prior features to generate change description sentences word by word.

[0033] Figure 5 This is a schematic diagram of the algorithm for a remote sensing image change semantic generation method proposed in Embodiment 1 of the present invention. Combining the steps S1 to S4 above, the formula for the implementation of the present invention can be expressed as follows: ; ; ; ; ; in, Indicates joint spatiotemporal characteristics; This indicates a causal reinforcement feature; Represent the true characteristics of change; Represents object-level prior features; This represents a trained spatiotemporal feature encoding network; This represents a reconfigurable causal filtering module; Represents the object query reasoning module; Indicates a difference text decoder; This represents the output text sequence.

[0034] During training, this invention comprehensively considers text generation loss, causal constraint loss, and object classification loss to construct a multi-task loss function; specifically: ; in, Represents the multi-task loss function; The weighting coefficients representing the loss due to causal constraints; The weighting coefficients represent the object classification loss.

[0035] To fully illustrate the process proposed in Embodiment 1 of the present invention, specific examples are provided below. Figure 2 (a) is the initial remote sensing image. Figure 2 (b) is a remote sensing image taken several years later; In this example, the inputs are an initial remote sensing image and remote sensing images several years later. Based on a change mask, the location of bounding boxes is detected, and major areas of geographic information change are identified. This guides the model to focus on areas with significant changes and performs logical reasoning based on these areas. Figure 2 The red border in the image represents the detection and positioning box.

[0036] For the above detection and positioning bounding box, the causal relationships of each key object (houses, roads, green plants, etc. within the box) are filtered to generate post-temporal features and characterize the causal structure.

[0037] Based on this causal structure and characteristics, background components (irrelevant ground features, color differences due to seasonal variations, etc.) are suppressed, while the focus is on and correlated with key changes. Figure 3 The attention focus comparison chart shown; Figure 3 (a) shows the basic feature map after preliminary feature extraction but before attention enhancement. Figure 3(b) shows the heatmap after attention-focused enhancement. Red areas represent areas of significant change, precisely pinpointing the actual locations of changes in ground features; orange / yellow areas represent areas of moderate interest, typically corresponding to the edges or transition zones of change areas; green areas represent areas of low interest, containing contextual information related to the change; and blue areas represent background areas of no interest, effectively suppressing interference caused by seasonal changes, lighting differences, etc. Figure 3 As shown, the feature enhancement technology of the present invention enables the model to accurately focus its attention on the real geographic information change area, while ignoring irrelevant interference terms, providing high-quality visual feature input for subsequent change description generation.

[0038] Based on the aforementioned key regions, changes between consecutive frames, and attention weight distribution, the model ultimately learns the changes between the two geographic remote sensing images in detail and models these changes into the natural language output. Figure 4 This is a schematic diagram illustrating the description results proposed in Embodiment 1 of the present invention; it accurately describes the disappearance and increase of geographic information, providing good technical support for ground remote sensing monitoring.

[0039] Experimental results on the public datasets LEVIR-CC and WHU-CDC demonstrate that the method of this invention significantly outperforms existing methods on evaluation metrics such as BLEU, METEOR, ROUGE-L, and CIDEr, achieving state-of-the-art performance. Specific comparisons are as follows: Table 1: Performance comparison on the LEVIR-CC dataset

[0040] Table 2: Performance Comparison on the WHU-CDC Dataset

[0041] The remote sensing image change semantic generation method proposed in Embodiment 1 of this invention can treat dual-temporal remote sensing images and change masks as short video sequences for spatiotemporal modeling within a unified deep learning framework. It improves the temporal consistency and causal rationality of feature representations through reconstructive causal constraints, enhances the response of small-scale change targets through frequency domain focusing mechanisms, and combines global context and object-level priors for reasoning. Thus, it generates more accurate, detailed, and robust natural language descriptions of changes while solving various ambiguity problems.

[0042] Example 2 This invention also proposes a device for generating semantic representations of changes in remote sensing images. Figure 6 This is a schematic diagram of a remote sensing image change semantic generation device proposed in Embodiment 2 of the present invention.

[0043] At the hardware level, the electronic device 600 includes a processor 610, and optionally, an internal bus 620, a network interface 630, and memory. The memory may include main memory, such as high-speed random-access memory (RAM), or it may also include non-volatile memory, such as at least one disk drive. Of course, the electronic device may also include other hardware required for other business operations. The processor 610, network interface 630, and memory can be interconnected via an internal bus 620. This internal bus 620 can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. The bus can be categorized as an address bus, data bus, control bus, etc. For ease of illustration, only a single bidirectional arrow is used in this diagram, but this does not imply that there is only one bus or one type of bus. The memory is used to store programs. Specifically, the program can include program code, which includes computer operation instructions. The memory can include main memory 640 and non-volatile memory 650, and provides instructions and data to the processor 610. Processor 610 reads the corresponding computer program from non-volatile memory 650 into memory 640 and then runs it, forming a device for locating the target user at the logical level. Processor 610 executes the program stored in memory and specifically performs the following: In step S1, the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask of the same geographical area are acquired; a three-frame short video sequence is constructed using the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask; and the joint spatiotemporal features of the three-frame short video sequence are extracted.

[0044] In step S2, the subsequent temporal features are reconstructed by fusing the previous temporal features in the joint spatiotemporal features with the change features extracted from the change mask, and the matching relationship between the subsequent temporal features and the real subsequent temporal features in the feature space is reconstructed by using contrastive learning constraints, so as to highlight the causal structure of the change process and obtain causal enhancement features. In step S3, the causal enhancement features are refocused in the frequency domain. After mapping the causal enhancement features to the frequency domain, the high-frequency detail response is enhanced and the low-frequency background components are suppressed. The processed features are then transformed back to the spatial domain to obtain the fused features. The variation features in the fused features are decoupled and weighted and aggregated with the global scene features to obtain the variation ground truth features. In step S4, object query reasoning is performed based on the change truth features to generate object-level prior features; the change truth features and object-level prior features are input together into the difference text decoder to output natural language text describing the changes in ground features between the previous and subsequent time-phase remote sensing images.

[0045] This device can be integrated into the onboard computer of a drone to achieve real-time target detection; it can also be used as a ground station to process images transmitted back by the drone offline.

[0046] Figure 1 It can be applied to processor 610, or implemented by processor 610. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit in the processor or by instructions in the form of software. The processor mentioned above can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the various methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the embodiments of this application can be directly embodied in the execution of the hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method. Of course, in addition to software implementation, the electronic device of this application does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. In other words, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0047] Example 3 The present invention also proposes a readable storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the following method steps: In step S1, the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask of the same geographical area are acquired; a three-frame short video sequence is constructed using the preceding time-phase remote sensing image, the following time-phase remote sensing image, and the change mask; and the joint spatiotemporal features of the three-frame short video sequence are extracted.

[0048] In step S2, the subsequent temporal features are reconstructed by fusing the previous temporal features in the joint spatiotemporal features with the change features extracted from the change mask, and the matching relationship between the subsequent temporal features and the real subsequent temporal features in the feature space is reconstructed by using contrastive learning constraints, so as to highlight the causal structure of the change process and obtain causal enhancement features. In step S3, the causal enhancement features are refocused in the frequency domain. After mapping the causal enhancement features to the frequency domain, the high-frequency detail response is enhanced and the low-frequency background components are suppressed. The processed features are then transformed back to the spatial domain to obtain the fused features. The variation features in the fused features are decoupled and weighted and aggregated with the global scene features to obtain the variation ground truth features. In step S4, object query reasoning is performed based on the change truth features to generate object-level prior features; the change truth features and object-level prior features are input together into the difference text decoder to output natural language text describing the changes in ground features between the previous and subsequent time-phase remote sensing images.

[0049] Embodiment 3 of this application also provides a storage medium, namely a computer storage medium, specifically a computer-readable storage medium, such as a memory that stores a computer program, which can be executed by a processor to complete the steps described in the aforementioned method. The computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disc, or CD-ROM.

[0050] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks. Alternatively, if the integrated units of this application are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.

[0051] This application provides a remote sensing image change semantic generation device in embodiment 3. For the description of the relevant part of the remote sensing image change semantic generation storage medium proposed in embodiment 3 of this application, please refer to the detailed description of the corresponding part in the remote sensing image change semantic generation method provided in embodiment 1 of this application, which will not be repeated here.

[0052] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that the elements inherent in a process, method, article, or apparatus that includes a list of elements are included. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element. Additionally, portions of the technical solutions provided in the embodiments of this application that are consistent with the implementation principles of corresponding technical solutions in the prior art have not been described in detail to avoid excessive elaboration.

[0053] While specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art can make other modifications or variations based on the above description. It is neither necessary nor possible to exhaustively describe all embodiments here. Various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.

Claims

1. A method for generating semantic changes in remote sensing images, characterized in that, Includes the following steps: Using pre-temporal remote sensing images, post-temporal remote sensing images, and change masks of the same geographic area, a three-frame short video sequence is constructed; the joint spatiotemporal features of the three-frame short video sequence are extracted. By fusing the pre-temporal features from the joint spatiotemporal features with the change features extracted from the change mask, the post-temporal features are reconstructed. Then, the matching relationship between the post-temporal features and the real post-temporal features in the feature space is reconstructed using contrastive learning constraints, so as to highlight the causal structure of the change process and obtain causal enhancement features. The causal enhancement features are refocused in the frequency domain. After mapping the causal enhancement features to the frequency domain, the high-frequency detail response is enhanced and the low-frequency background components are suppressed. The processed features are then transformed back to the spatial domain to obtain the fused features. The variation features in the fused features are decoupled and weighted and aggregated with the global scene features to obtain the variation ground truth features. Based on the change truth features, object query reasoning is performed to generate object-level prior features; the change truth features and object-level prior features are input together into the differential text decoder to output natural language text describing the changes in ground features between the previous and current time-phase remote sensing images.

2. The remote sensing image change semantic generation method according to claim 1, characterized in that, Construct a three-frame short video sequence as follows: Previous time-phase remote sensing images As the first frame, the changing mask As the second frame, a later-temporal remote sensing image As the third frame, construct the input triplet. They are stacked along the time dimension to form a video data format for input spatiotemporal feature coding networks.

3. The remote sensing image change semantic generation method according to claim 1, characterized in that, The joint spatiotemporal features of the three short video frames are extracted as follows: The three short video frames are input into a trained spatiotemporal feature coding network. Joint spatiotemporal features are extracted through 3D convolution or video Transformer structures. ,in, .

4. The remote sensing image change semantic generation method according to claim 1, characterized in that, By fusing pre-temporal features from joint spatiotemporal features with change features extracted from a change mask to reconstruct post-temporal features, and then using contrastive learning constraints to reconstruct the matching relationship between post-temporal features and true post-temporal features in the feature space, the causal structure of the change process is highlighted, resulting in causal enhancement features, specifically: For the joint spatiotemporal features A reconstructive causal filtering method is used, fusing previous-phase features with change features extracted from the change mask to reconstruct subsequent-phase features. The reconstructed subsequent-phase features are used as positive samples, the true subsequent-phase features as positive sample keys, and the subsequent-phase features of other samples as negative sample keys. A contrastive learning causal constraint loss is applied. ; In the feature space, constrain the matching relationship between the reconstructed temporal features and the true temporal features, and enhance the distinguishability with the temporal features of other samples, outputting causal enhancement features. , represented as .

5. The remote sensing image change semantic generation method according to claim 1, characterized in that, The variation features and global scene features in the fused features are decoupled and weighted to obtain the change ground truth features, specifically: Enhance causal features By mapping to the frequency domain through discrete Fourier transform, learnable channel masks and spatial masks are applied to adaptively enhance high-frequency detail components and suppress low-frequency background components; then, the fused features are obtained by returning to the spatial domain through inverse Fourier transform. The fused features are decoupled and weighted, and the ground truth components representing local changes are extracted while retaining the global contextual components reflecting the overall layout. The ground truth features of change are then output. .

6. The remote sensing image change semantic generation method according to claim 1, characterized in that, Based on the aforementioned change truth value features, object query reasoning is performed to generate object-level prior features; Specifically: Change the true value characteristics Interact with a set of learnable object query embeddings, and use an attention mechanism to calculate the similarity between each object query embedding and the changing ground truth features to mine potential ground objects in the image; Based on the similarity, predict the category of each object and the probability of its change, construct object-level prior features to characterize specific land cover categories and their change states, and output the object-level prior features. And apply object classification loss. To conduct oversight.

7. The remote sensing image change semantic generation method according to claim 1, characterized in that, The aforementioned change truth features and object-level prior features are input together into the differential text decoder, which outputs natural language text describing the changes in ground features between the preceding and following temporal remote sensing images, specifically: The true value features of the change With the object-level prior features The common input differential text decoder, based on the Transformer architecture, first models the context of the generated word sequence through a self-attention mechanism during text decoding. Then, it uses a cross-attention mechanism to focus on the ground truth features of changes and the object-level prior features respectively, in order to fuse visual change information and object semantic information to predict the next word word by word. Finally, a sentence describing the change is generated. and apply text generation loss. To conduct oversight.

8. The remote sensing image change semantic generation method according to any one of claims 1, 4, 6 and 7, characterized in that, The method further includes: constructing a multi-task loss function by comprehensively considering text generation loss, causal constraint loss, and object classification loss; specifically: ; in, Represents the multi-task loss function; The weighting coefficients representing the loss due to causal constraints; The weighting coefficients represent the object classification loss.

9. The remote sensing image change semantic generation method according to claim 2, characterized in that, The changing mask Using a change detection network to analyze previous phase remote sensing images and post-temporal remote sensing images After processing, the change detection network is generated to identify pixel regions that have changed between the previous and subsequent time-phase remote sensing images, and outputs the change mask in the form of a binary image or a probability map.

10. A remote sensing image change semantic generation device, comprising at least one processor and a memory, wherein the memory stores a computer program, characterized in that, When the computer program is executed by the at least one processor, it implements a remote sensing image change semantic generation method as described in any one of claims 1 to 9.