A counterfactual causality enhanced remote sensing multi-modal image description generation method
By using counterfactual contrast training and causal consistency constraints, non-causal imaging perturbations are explicitly modeled, and the remote sensing multimodal large model is optimized. This solves the semantic instability problem of remote sensing image description models under changing imaging conditions, and improves the reliability and cross-domain adaptability of the description.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF MINING & TECH
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
Existing remote sensing image description models are susceptible to changes in imaging conditions, lack cross-domain generalization, have poor stability in segmented descriptions, and struggle to maintain semantic consistency in complex imaging environments.
By introducing counterfactual contrast training and causal consistency constraints, we can explicitly model and intervene in non-causal imaging perturbations such as atmospheric scattering and radiation changes, construct sample pairs of counterfactual images and original images, and optimize the remote sensing multimodal large model to maintain semantic consistency.
It improves the reliability and transferability of remote sensing multimodal image description under complex imaging conditions, and is suitable for land cover surveys, urban monitoring, disaster emergency interpretation, and automated remote sensing report generation.
Smart Images

Figure CN122024239B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of remote sensing image understanding and natural language generation, specifically to a method for generating remote sensing multimodal image descriptions with counterfactual causal enhancement. Background Technology
[0002] With the rapid development of remote sensing observation platforms, sensor systems, and multimodal large-scale model technologies, remote sensing image understanding and natural language generation are demonstrating increasingly important application value in scenarios such as land cover monitoring, disaster emergency assessment, refined urban governance, and resource surveys. Optical remote sensing images can intuitively reflect the texture, shape, and spatial structure information of ground features, facilitating human interpretation. Simultaneously, multi-source auxiliary information (such as task instructions and prior knowledge text) can provide semantic constraints for description generation, making the model more aligned with application needs in terms of "objects of interest, descriptive granularity, and expression style." Therefore, multimodal image description technology for single optical remote sensing images has become an important direction for promoting intelligent remote sensing interpretation and automated report generation.
[0003] Traditional remote sensing image description methods often rely on manually designed feature extraction and templated language generation strategies, which struggle to fully depict the multi-scale structure and spatial relationships of ground features in complex scenes. Furthermore, the generated text is prone to problems such as limited expression, incomplete coverage, or semantic bias. In recent years, the development of deep learning and visual language models has provided new solutions for remote sensing image description: generative models based on encoder-decoder structures can automatically learn image representations and generate natural language descriptions. Combining attention mechanisms and cross-modal interaction modules can, to some extent, improve the ability to focus on key areas and the relevance of the descriptive content. However, remote sensing images exhibit significant appearance differences across different regions, seasons, and imaging conditions. Non-causal imaging perturbations such as atmospheric scattering, radiation and illumination variations, resolution degradation and blurring, compression and resampling artifacts, sensor noise, and striping can alter the image appearance but should not determine the scene's semantics. This makes it easy for models to learn "pseudo-relevant features" or "shortcut cues" related to the target description during training, leading to unstable descriptions, detail drift, and even semantic expressions inconsistent with the real scene when applied across domains.
[0004] Furthermore, paragraph-based descriptions typically need to simultaneously meet requirements such as inter-sentence structural organization, complete information coverage, and semantic consistency, making them more susceptible to imaging perturbations and data distribution shifts than single-sentence descriptions. Existing methods often employ conventional data augmentation or simple consistency regularization to improve robustness, but these methods often fail to guarantee strict consistency between augmented and original samples at the target semantic level. This is especially problematic when the semantics of the target region are easily compromised, as this can introduce additional noise supervision, thereby weakening the model's stability and transferability. Therefore, there is an urgent need for a training method that can explicitly model and intervene in non-causal imaging perturbations while maintaining the target semantics, enabling the model to learn a causal robust representation that "the semantics of the description remain stable under changing imaging conditions." This would improve the reliability and generalization ability of remote sensing multimodal image descriptions in complex imaging environments and across different domains. Summary of the Invention
[0005] The purpose of this invention is to provide a counterfactual causal enhancement method for generating remote sensing multimodal image descriptions. This method introduces counterfactual contrastive training and causal consistency constraints to explicitly model and intervene in non-causal imaging disturbances such as atmospheric scattering, radiation and illumination variations, resolution degradation and blurring, compression and resampling artifacts, sensor noise, and striping, while maintaining the semantic integrity of the target image. This addresses the problems of existing remote sensing image description models being susceptible to changes in imaging conditions, insufficient cross-domain generalization, and poor stability of segmented descriptions, thereby improving the reliability, consistency, and transferability of multimodal description generation. This method can achieve robust semantic understanding and segmented natural language generation of single optical remote sensing images in complex remote sensing imaging environments, and is applicable to applications such as land cover surveys, urban and ecological monitoring, disaster emergency interpretation, and automated remote sensing report generation. To achieve the above objectives, this invention provides a counterfactual causal enhancement method for generating remote sensing multimodal image descriptions, comprising the following steps:
[0006] S1. Obtain training samples, which include a single optical remote sensing image and the corresponding target description text, and obtain the instruction text corresponding to the optical remote sensing image. The instruction text is used to limit the content range or expression form of the output description.
[0007] S2. Set a set of non-causal imaging perturbation types, and select the non-causal imaging perturbation type and its perturbation degree for the training samples;
[0008] S3. Based on the segmentation model, the optical remote sensing image is segmented into target regions to obtain a target region mask. Morphological processing is then performed on the target region mask to generate a target region protection mask, thereby determining the background region. Under the constraint of the target region protection mask, the non-causal imaging perturbation is applied to the background region to construct a counterfactual image. This ensures that the target region content of the counterfactual image remains the same as the target region content of the optical remote sensing image, while the background region experiences different imaging conditions due to the applied perturbation.
[0009] S4. Under the same instruction text conditions, input the optical remote sensing image and the counterfactual image into the remote sensing multimodal large model to generate corresponding description results;
[0010] S5. Based on the target description text, supervise the description generation process and introduce counterfactual consistency constraints to ensure that the description results of the optical remote sensing image and the counterfactual image are semantically consistent, and jointly optimize the remote sensing multimodal large model.
[0011] S6. Using the optimized remote sensing multimodal large model, generate output description text for the single optical remote sensing image to be described under the command text conditions.
[0012] Furthermore, the steps in S1 for obtaining training samples and instruction text are as follows:
[0013] S1.1 Construct a training sample set and organize the training data into a pair format of "image - target description - instruction text" to ensure the consistency and traceability of the supervision signal and conditional input during subsequent training. Second, perform sample integrity verification on the training samples to ensure that each sample contains at least a single optical remote sensing image, the target description text corresponding to the single optical remote sensing image, and the instruction text corresponding to the single optical remote sensing image, thereby providing a unified data interface for counterfactual sample construction and consistency constraint calculation.
[0014]
[0015] in, Represents the training sample set, Indicates the first A training sample triplet, This indicates a single optical remote sensing image input. Indicates and The corresponding target description text, Indicates and The corresponding instruction text, Indicates the number of training samples. For sample index.
[0016] S1.2. The target description text is represented in paragraph structure so that the target description can explicitly depict the hierarchical structure of "paragraphs are composed of multiple sentences and sentences are composed of word sequences". This allows for the supervision of the overall paragraph semantics during subsequent training, as well as the consistency constraints on the organization between sentences and the expression within sentences when needed, thereby improving the stability and controllability of paragraph generation.
[0017]
[0018]
[0019] in, Indicates the number of sentences in a paragraph. Indicates the first Sentence text, Indicates the first The first in the sentence Each word element, Indicates the first The word length of the sentence; the above representation is used to uniformly model paragraph-level supervision signals during the training phase and to provide standardized text objects for subsequent semantic consistency calculations.
[0020] S1.3 Obtain and standardize the instruction text, represent the instruction text as a word sequence and perform format unification and field standardization processing. The standardization processing is used to eliminate instruction noise and unify the instruction expression form, so that the same task constraint has consistent conditional semantics in different samples, thereby ensuring that the model can generate comparable paragraph description results under the same conditions when the original image and counterfactual image are input.
[0021]
[0022]
[0023] in, This indicates the first in the instruction text. Each word element, Indicates the length of the instruction text tokens. Indicates the instruction normalization operator, This represents the original instruction text. This represents the normalized instruction text; the normalization operator can be used to achieve instruction templates, reduce redundant information, and unify the expression of key constraint fields.
[0024] S1.4. Structured constraint information is obtained from the parsed normalized instruction text and used as a conditional control quantity in the paragraph description generation process to achieve unified control over sentence structure, paragraph organization, focus, and expression style. At the same time, the image, normalized instruction, and structured constraint are combined into a standardized input unit so that the counterfactual sample construction, dual-path description generation, and consistency constraint calculation in the subsequent training stage are all based on the same symbol system.
[0025]
[0026] in, Indicates the instruction parsing operator, Represents the structured constraint vector, where This indicates sentence count or length constraints. This indicates paragraph structure constraint information. This indicates the constraint information of the object of interest. This represents the expression style constraint information; the structured constraint vector is used to consistently control the output form and content focus of paragraph descriptions during the training and inference phases. Then, a single optical remote sensing image is... Standardized instruction text and structured constraint vectors Composition of standardized input units
[0027]
[0028] in, This represents a standardized input unit, which consists of a single optical remote sensing image. Standardized instruction text and structured constraint vectors constitute;
[0029] Because the training dataset uses triples ,in Indicates and The corresponding target description text, therefore each triple can be... The "instruction normalization and parsing" is used to map the equivalent units of supervised training. This allows subsequent counterfactual sample construction, dual-path paragraph description generation, and consistency constraint calculation to be completed under a unified input / output interface.
[0030] Furthermore, the step in S2 of setting a set of non-causal imaging perturbation types and selecting perturbation types and their degrees for training samples is as follows:
[0031] S2.1 Construct a set of non-causal imaging perturbation types to characterize imaging perturbation factors that are unrelated to the target semantics but cause changes in the appearance of the image, and bind each type of perturbation to an executable perturbation operator family to ensure that the corresponding perturbation can be called with a unified interface in the subsequent counterfactual image construction; secondly, standardize the perturbation type set so that it can be selected and recorded in a consistent identification manner in different training batches and different samples, thereby supporting subsequent robust training and traceable analysis.
[0032]
[0033]
[0034] in, This represents the set of non-causal imaging perturbation types. Indicates the first Types of perturbations Indicates the number of disturbance types. Index for disturbance type; Denotes the set of perturbation operators. Indication and Disturbance Type The corresponding perturbation operator family or perturbation generation rule is used to apply the selected perturbation to a specified area of the image.
[0035] S2.2 For each training sample, select at least one perturbation type from the set of non-causal imaging perturbation types according to the sample content and instruction conditions, and determine the perturbation degree corresponding to the perturbation type; wherein the perturbation degree is used to control the magnitude of the perturbation's influence on the background appearance, thereby forming a counterfactual sample distribution that covers different imaging condition variations; secondly, random sampling, strategy sampling, or adaptive sampling based on historical sensitivity can be used to improve the diversity and targeting of perturbation selection.
[0036]
[0037]
[0038] in, Represented as the first The perturbation type selected for each sample This represents the disturbance level control quantity corresponding to the disturbance type; Indicates that in a given image With instruction text Disturbance type selection distribution under certain conditions This represents the distribution of perturbation degree selection under given perturbation type and sample conditions; the distribution can be implemented by preset rules, random mechanisms or learnable strategies to ensure that the perturbation selection is controllable and scalable.
[0039] S2.3. Based on the selected perturbation type and degree, the corresponding perturbation operator family is called to generate perturbation operator instances for subsequent counterfactual construction. The perturbation type, perturbation operator instances and their control variables are used as control conditions for counterfactual image generation so that the perturbation is consistently applied to the background region under the target region protection constraint. Secondly, to ensure the stability and reproducibility of the training process, the perturbation type identifier and control variable corresponding to each sample can be recorded and replayed or verified for consistency when needed.
[0040]
[0041] in, Indicates the first Instances of perturbation operators corresponding to each sample; This indicates that the perturbation operator instance is applied to the input image. The obtained perturbation results; Indicates the disturbance type as The perturbation operator family in terms of perturbation degree Controlled action on input image The obtained perturbation result satisfies . Represents the set of executable perturbation operator instances. The perturbation operator instance is used to apply noncausal imaging perturbations to the background region in subsequent steps to construct counterfactual images.
[0042] Furthermore, the step of constructing the counterfactual image under the constraint of the target region protective mask in S3 is as follows:
[0043] S3.1. Based on the target area protection mask, the optical remote sensing image is divided into regions to clarify the spatial range of the target area and the background area. The target area is regarded as the semantically preserved protected area, and the background area is regarded as the area where non-causal imaging perturbation is applied. The purpose of the region division is to ensure that the counterfactual construction only changes the imaging appearance factors that are unrelated to the target semantics, so that the counterfactual sample and the original sample are consistent at the target semantic level but have controllable differences at the imaging condition level.
[0044]
[0045] in, Indicates the first A single optical remote sensing image of a sample. Indicates the protective mask for the target area. Indicates the background area mask. This represents a matrix of all ones with the same size as the image. This indicates element-wise multiplication; the above expression is used to divide an image into two parts, a target region and a background region, according to the mask. The background region mask and the target region protection mask are complementary region masks, providing a constraint basis for applying perturbation only to the background region in the future.
[0046] S3.2. Call the perturbation operator instance corresponding to the selected non-causal imaging perturbation type, and apply the non-causal imaging perturbation to the background region under the constraint of the perturbation degree control amount to obtain the image content after background perturbation; wherein the operation of applying the non-causal imaging perturbation only acts on the pixel position covered by the background region mask to avoid damaging the key structure and semantic clues of the target region, and to enable the perturbation effect to be consistently reused to construct comparable counterfactual sample pairs.
[0047]
[0048] in, Indicates the first Background perturbation results obtained for each sample after applying non-causal imaging perturbation. Indicates the first Each sample corresponds to an executable perturbation operator instance. This indicates that the instance of the perturbation operator is composed of perturbation type 1. The perturbation operator family in terms of perturbation degree Input image under control The background perturbation result is obtained by transformation; it is used to fuse with the target area content in subsequent steps to construct a counterfactual image.
[0049] S3.3. While keeping the target region unchanged, the original image content of the target region is fused with the perturbed image content of the background region to obtain a counterfactual image. The fusion process follows the combination rule of "target region preservation and background region perturbation" at the pixel / feature level, thereby ensuring that the counterfactual image maintains consistency with the target semantics. At the same time, the imaging condition changes are explicitly introduced to support counterfactual contrast training and causal reinforcement learning.
[0050]
[0051] in, Indicates the first Counterfactual image corresponding to each sample This represents the image content after background perturbation; obtained through the above fusion. With the original image While maintaining consistency in the target region, the image appearance in the background region is varied by the type and degree of perturbation, thus forming counterfactual sample pairs that can be used for consistency training.
[0052] S3.4. The counterfactual image and the original image are paired to form a sample pair and a correspondence is established under the same instruction condition. In the subsequent description generation and consistency constraint calculation, the condition input of the original branch and the counterfactual branch is consistent, so that the model training can be carried out around the goal of "the description semantics remain unchanged under non-causal perturbations". At the same time, the perturbation type identifier and control quantity corresponding to each sample pair can be recorded for training process playback and traceability analysis.
[0053]
[0054] in, Describes the set of counterfactual sample pairs. Indicated in the same instruction text The original image and counterfactual image sample pairs under the given conditions; the set of sample pairs is used for dual-path description generation and counterfactual consistency constraint calculation in subsequent steps.
[0055] Furthermore, the step in S4 of generating paragraph-style descriptions for the original image and the counterfactual image under the same instruction text conditions is as follows:
[0056] S4.1 Generate structural planning information for paragraph descriptions under the constraints of the instruction text. The structural planning information is used to explicitly represent the organization method and content focus between paragraph sentences, so that the subsequent generation process can maintain consistency in terms of "sentence structure, sentence connection, focus and expression style", thereby ensuring that the outputs of the original branch and the counterfactual branch are comparable and providing a stable semantic alignment basis for counterfactual consistency constraints.
[0057]
[0058] in, Indicates the first Paragraph structure planning information for each sample This represents the parameters of a remote sensing multimodal large model. The determined planning generation operator, This represents a single optical remote sensing image. The instruction text is represented; the planning information can be used to limit the sentence organization, content coverage, or expression form of paragraphs to improve the structural stability and conditional consistency of paragraph generation.
[0059] S4.2. Multimodal representation extraction and fusion of images and instruction text are performed to map image content and instruction constraints to a unified conditional semantic space, enabling the generator to utilize both image semantic cues and instruction constraints during the decoding stage. The fused conditional representation serves as the conditional input for paragraph-style description generation, ensuring the controllability and consistency of the generated content under the same instruction conditions.
[0060]
[0061] in, Represents the visual encoding operator. Indicates the instruction text encoding operator. Indicates image representation. Indicates the representation of the instruction. This represents the cross-modal fusion operator. This represents the conditional semantic representation after fusion. The information is used to plan the paragraph structure; the fusion representation is used to simultaneously constrain the "generated content" and "paragraph organization method" during the decoding stage, thereby improving the stable output capability of paragraph description.
[0062] S4.3 Under the constraints of the fusion condition representation and paragraph structure planning information, paragraph-style descriptive text is output using a sequence generation method. The paragraph-style descriptive text consists of multiple sentences and can be further represented as a hierarchical structure of "sentence sequence - word sequence" so as to uniformly supervise paragraph-level semantics and inter-sentence structure during training.
[0063]
[0064]
[0065]
[0066] in, This indicates a text decoding generation operator. Indicates the first Paragraph-style descriptions generated from each sample This indicates the number of sentences in the generated paragraph. Indicates the first sentence, Indicates words within a sentence. Indicates the first The word length of the sentence; the hierarchy is used to characterize the inter-sentence organization and intra-sentence expression of the paragraph output, so that subsequent consistency constraints can be stably aligned at the paragraph semantic level.
[0067] S4.4. Under the same instruction text conditions, perform the above paragraph generation process on the original image and the counterfactual image respectively to obtain the original branch description result and the counterfactual branch description result, and establish a one-to-one correspondence between the two under the same condition constraint, so as to be used for subsequent calculation and joint optimization of counterfactual consistency constraints.
[0068]
[0069]
[0070] in, This represents the conditional paragraph generation mapping for a large remote sensing multimodal model. Represents the original image In the instruction text Paragraph description results under certain conditions Counterfactual images Paragraph description results under the same instruction text conditions; by generating two description results under the same input conditions, it can be ensured that the difference between the two mainly comes from the change of imaging conditions rather than the difference of instructions, thus providing reliable input for subsequent counterfactual consistency training.
[0071] Furthermore, the steps in S5, which involve supervising the paragraph description generation process based on the target description text and introducing counterfactual consistency constraints to jointly optimize the remote sensing multimodal large model, are as follows:
[0072] S5.1 Construct description generation supervision constraints based on the generation results of the target description text and the original image branch, so that the model can generate description text that is consistent with the target semantics and has a reasonable paragraph structure under the instruction text condition; wherein the supervision constraints take the word sequence of the paragraph text as the supervision object, and constrain the conditional probability of each step of generation under the autoregressive generation framework to ensure that the generated content is aligned with the target description at the semantic level.
[0073]
[0074] in, This describes the generated supervised loss. These represent the trainable parameters of a large multimodal remote sensing model. Represents the training sample set, This indicates that the model is based on the input image. With instruction text Generate target paragraph description under certain conditions The conditional probability, It represents the mathematical expectation over the training sample set; the supervised loss is used to constrain the consistency between the model output and the target description text, and to provide a stable generation baseline for the subsequent introduction of counterfactual consistency.
[0075] S5.2 Construct counterfactual consistency constraints based on two description results generated from the original image and counterfactual image under the same instruction text conditions, so that the model maintains semantic invariance to non-causal imaging perturbations in a causal sense; wherein the counterfactual consistency constraints are achieved by mapping the two descriptions to a unified semantic space and constraining the consistency of their semantic embedding representation, thereby reducing the model's sensitivity to changes in imaging appearance and avoiding using non-causal perturbations as the basis for description decisions.
[0076]
[0077] in, This indicates a semantic embedding encoder. Indicate its parameters, The paragraph description results representing the original image branches, The paragraph description results representing the counterfactual image branch, and These represent the semantic embedding vectors corresponding to the two descriptions respectively; the semantic embedding encoder can be implemented by the text encoding module of the model to provide comparable semantic representations.
[0078]
[0079] in, This represents the semantic embedding consistency constraint loss. Describes the set of counterfactual sample pairs. Represents cosine similarity. This represents the mathematical expectation over the set of counterfactual sample pairs; by penalizing inconsistencies in semantic embeddings, this loss enables the model to maintain semantic stability of paragraph descriptions even when background imaging conditions change, thereby achieving the training objective of counterfactual causal enhancement.
[0080] S5.3. Describe the generation supervision constraint and the counterfactual consistency constraint together, and update the model parameters based on the joint objective to simultaneously take into account "generation accuracy" and "causal robustness". The joint objective controls the relative contribution of the two types of constraints through a trade-off term, so that the model can learn a reliable image-text mapping relationship and maintain the invariance of semantic output when non-causal imaging perturbations exist.
[0081]
[0082] in, Indicates the joint training objective. Represents the consistency weight coefficients; by minimizing the joint objective, the model parameters are jointly updated. And optionally update the semantic embedding encoder parameters. This allows for the joint optimization of a large multimodal remote sensing model.
[0083]
[0084] in, This represents the parameter solution after joint optimization; the parameter solution is used in the subsequent inference stage to generate paragraph-style descriptive text under the instruction text conditions, and to maintain semantic consistency when imaging conditions change.
[0085] Furthermore, the step in S6 of generating output paragraph-style descriptive text for a single optical remote sensing image under the instruction text condition using the jointly optimized remote sensing multimodal large model is as follows:
[0086] S6.1 Obtain the single optical remote sensing image to be described and its corresponding instruction text, and perform normalization and parsing processing on the instruction text in the same way as in the training phase to obtain structured constraint information for controlling the paragraph output form; wherein the structured constraint information is used to limit the sentence structure, sentence organization, focus and expression style of the paragraph output in the inference phase, so that the model can stably output paragraph-style description results that meet the requirements in different application scenarios.
[0087]
[0088] in, This represents the instruction text input during the inference phase. Indicates the instruction normalization operator, This represents the normalized instruction text. Indicates the instruction parsing operator, This represents the structured constraint vector obtained from parsing the instruction text; the structured constraint vector is used to control the output form and content focus of the paragraph description, and is consistent with the conditional semantics in the training phase.
[0089] S6.2. The image to be described and the standardized instruction text and their structured constraint information are combined into a standardized input unit, and then input into the jointly optimized remote sensing multimodal large model to generate paragraph-style description text under the same condition control. The standardized input unit is used to ensure that the data interface in the inference stage is consistent with that in the training stage, so that the model can directly reuse the condition generation capability obtained from training and avoid output instability due to differences in input format.
[0090]
[0091] in, This represents a single optical remote sensing image to be described. This represents the standardized input unit for the reasoning stage, which consists of images. Standardized instruction text and structured constraint vectors The standardized input unit serves as a conditional input to trigger the paragraph-style description generation process.
[0092] S6.3. Based on the standardized input unit, call the jointly optimized remote sensing multimodal large model to generate paragraph-style descriptive text and output it as the reasoning result; wherein the generation process outputs multiple paragraphs under the constraints of structural planning information and multimodal fusion condition representation, so that the output text conforms to the instruction constraints in terms of inter-sentence structure and semantic content, and maintains the stability and consistency of semantic expression when imaging conditions change.
[0093]
[0094] in, The parameter is The jointly optimized remote sensing multimodal large model The paragraph-style descriptive text generated by the model is output as a multi-sentence structure under the constraint of the instruction text, and serves as the final description result of the image to be described.
[0095] Beneficial effects: This invention acquires a single optical remote sensing image and its corresponding paragraph-style descriptive text and instruction text; constructs a set of non-causal imaging perturbation types independent of descriptive semantics, and samples perturbation types and degrees for training samples; uses a segmentation model to obtain a target region segmentation mask and performs morphological processing to generate a target region protection mask to distinguish the target region from the background region and maintain target semantic stability; under the target region protection constraint, non-causal imaging perturbations are applied only to the background region to construct counterfactual images, forming sample pairs that are consistent with the original image in target descriptive text but differ in imaging conditions; under the same instruction text conditions, the original image and counterfactual image are input into a remote sensing multimodal large model to generate paragraph descriptions. The generated content is constrained to be consistent with the target text by description generation supervision loss, and semantic embedding consistency loss is introduced to constrain the consistency of the two descriptions in semantic space, thereby suppressing the model's dependence on non-causal imaging perturbations; during the inference stage, the image to be described and the instruction text are input to output paragraph-style description results. The method of this invention can maintain the consistency of the description results in the semantic embedding space under complex imaging conditions such as atmospheric scattering, changes in radiation illumination, resolution degradation and blurring, compression resampling artifacts, and noise stripes. It significantly improves the reliability and transferability of paragraph-based descriptions and is applicable to application scenarios such as land cover surveys, urban and ecological monitoring, disaster emergency interpretation, and automated remote sensing report generation. Attached Figure Description
[0096] Figure 1 This is a schematic diagram of the overall process of a method for generating counterfactual causal-enhanced remote sensing multimodal image descriptions;
[0097] Figure 2 This is a schematic diagram of the training data preprocessing and instruction text parsing process;
[0098] Figure 3A schematic diagram of the non-causal imaging perturbation sampling and instantiation process;
[0099] Figure 4 This is a schematic diagram of the counterfactual image construction process under the target area protection constraints;
[0100] Figure 5 This is a schematic diagram of the segmented description generation structure of a large remote sensing multimodal model;
[0101] Figure 6 This is a schematic diagram of the joint optimization training process of generating supervised loss and semantic embedding consistency loss;
[0102] Figure 7 This is a flowchart illustrating the reasoning stage; Detailed Implementation
[0103] The invention will now be further described with reference to the accompanying drawings.
[0104] Example
[0105] Furthermore, such as Figure 1 As shown, a method for generating remote sensing multimodal image descriptions with counterfactual causal enhancement includes the following steps:
[0106] S1. Obtain training samples, which include a single optical remote sensing image and the corresponding target description text, and obtain the instruction text corresponding to the optical remote sensing image. The instruction text is used to limit the content range or expression form of the output description.
[0107] S2. Set a set of non-causal imaging perturbation types, and select the non-causal imaging perturbation type and its perturbation degree for the training samples;
[0108] S3. Based on the segmentation model, the optical remote sensing image is segmented into target regions to obtain a target region mask. Morphological processing is then performed on the target region mask to generate a target region protection mask, thereby determining the background region. Under the constraint of the target region protection mask, the non-causal imaging perturbation is applied to the background region to construct a counterfactual image. This ensures that the target region content of the counterfactual image remains the same as the target region content of the optical remote sensing image, while the background region experiences different imaging conditions due to the applied perturbation.
[0109] S4. Under the same instruction text conditions, input the optical remote sensing image and the counterfactual image into the remote sensing multimodal large model to generate corresponding description results;
[0110] S5. Based on the target description text, supervise the description generation process and introduce counterfactual consistency constraints to ensure that the description results of the optical remote sensing image and the counterfactual image are semantically consistent, and jointly optimize the remote sensing multimodal large model.
[0111] S6. Using the optimized remote sensing multimodal large model, generate output description text for the single optical remote sensing image to be described under the command text conditions.
[0112] Furthermore, such as Figure 2 As shown, the steps in S1 for obtaining training samples and instruction text are as follows:
[0113] S1.1 Construct a training sample set and organize the training data into a pair format of "image - target description - instruction text" to ensure the consistency and traceability of the supervision signal and conditional input during subsequent training. Second, perform sample integrity verification on the training samples to ensure that each sample contains at least a single optical remote sensing image, the target description text corresponding to the single optical remote sensing image, and the instruction text corresponding to the single optical remote sensing image, thereby providing a unified data interface for counterfactual sample construction and consistency constraint calculation.
[0114]
[0115] in, Represents the training sample set, Indicates the first A training sample triplet, This indicates a single optical remote sensing image input. Indicates and The corresponding target description text, Indicates and The corresponding instruction text, Indicates the number of training samples. For sample index.
[0116] S1.2. The target description text is represented in paragraph structure so that the target description can explicitly depict the hierarchical structure of "paragraphs are composed of multiple sentences and sentences are composed of word sequences". This allows for the supervision of the overall paragraph semantics during subsequent training, as well as the consistency constraints on the organization between sentences and the expression within sentences when needed, thereby improving the stability and controllability of paragraph generation.
[0117]
[0118]
[0119] in, Indicates the number of sentences in a paragraph. Indicates the first Sentence text, Indicates the first The first in the sentence Each word element, Indicates the first The word length of the sentence; the above representation is used to uniformly model paragraph-level supervision signals during the training phase and to provide standardized text objects for subsequent semantic consistency calculations.
[0120] S1.3 Obtain and standardize the instruction text, represent the instruction text as a word sequence and perform format unification and field standardization processing. The standardization processing is used to eliminate instruction noise and unify the instruction expression form, so that the same task constraint has consistent conditional semantics in different samples, thereby ensuring that the model can generate comparable paragraph description results under the same conditions when the original image and counterfactual image are input.
[0121]
[0122]
[0123] in, This indicates the first in the instruction text. Each word element, Indicates the length of the instruction text tokens. Indicates the instruction normalization operator, This represents the original instruction text. This represents the normalized instruction text; the normalization operator can be used to achieve instruction templates, reduce redundant information, and unify the expression of key constraint fields.
[0124] In one specific embodiment of the present invention, the instruction normalization operator Standardized instruction text can be extracted at low cost through regular expression matching, rule-based text cleaning (such as removing stop words, standardizing capitalization and full-width / half-width characters), or conventional natural language processing toolkits.
[0125] S1.4. Structured constraint information is obtained from the parsed normalized instruction text and used as a conditional control quantity in the paragraph description generation process to achieve unified control over sentence structure, paragraph organization, focus, and expression style. At the same time, the image, normalized instruction, and structured constraint are combined into a standardized input unit so that the counterfactual sample construction, dual-path description generation, and consistency constraint calculation in the subsequent training stage are all based on the same symbol system.
[0126]
[0127] in, Indicates the instruction parsing operator, Represents the structured constraint vector, where This indicates sentence count or length constraints. This indicates paragraph structure constraint information. This indicates the constraint information of the object of interest. This indicates information related to expressive style constraints;
[0128] In one specific embodiment of the present invention, the instruction parsing operator This can be achieved by deploying an agent-based language model system. Specifically, this involves standardizing the instruction text. The input is fed into the central agent, which is guided by a pre-set task prompt template to extract structured information and understand intent, thereby stably outputting discrete or continuous length constraints. Structural constraints Target of attention and expression style This ensures that complex instruction conditions can be effectively reduced in dimensionality and injected into the generative model.
[0129] The structured constraint vectors are used to consistently control the output form and content focus of paragraph descriptions during the training and inference phases. Then, a single optical remote sensing image is... Standardized instruction text and structured constraint vectors Composition of standardized input units
[0130]
[0131] in, This represents a standardized input unit, which consists of a single optical remote sensing image. Standardized instruction text and structured constraint vectors constitute;
[0132] Because the training dataset uses triples ,in Indicates and The corresponding target description text, therefore each triple can be... The "instruction normalization and parsing" is used to map the equivalent units of supervised training. This allows subsequent counterfactual sample construction, dual-path paragraph description generation, and consistency constraint calculation to be completed under a unified input / output interface.
[0133] Furthermore, such as Figure 3 As shown, the steps in S2 of setting a set of non-causal imaging perturbation types and selecting perturbation types and their degrees for training samples are as follows:
[0134] S2.1 Construct a set of non-causal imaging perturbation types to characterize imaging perturbation factors that are unrelated to the target semantics but cause changes in the appearance of the image, and bind each type of perturbation to an executable perturbation operator family to ensure that the corresponding perturbation can be called with a unified interface in the subsequent counterfactual image construction; secondly, standardize the perturbation type set so that it can be selected and recorded in a consistent identification manner in different training batches and different samples, thereby supporting subsequent robust training and traceable analysis.
[0135]
[0136]
[0137] in, This represents the set of non-causal imaging perturbation types. Indicates the first Types of perturbations Indicates the number of disturbance types. Index for disturbance type; Denotes the set of perturbation operators. Indication and Disturbance Type The corresponding perturbation operator family or perturbation generation rule is used to apply the selected perturbation to a specified area of the image.
[0138] S2.2 For each training sample, select at least one perturbation type from the set of non-causal imaging perturbation types according to the sample content and instruction conditions, and determine the perturbation degree corresponding to the perturbation type; wherein the perturbation degree is used to control the magnitude of the perturbation's influence on the background appearance, thereby forming a counterfactual sample distribution that covers different imaging condition variations; secondly, random sampling, strategy sampling, or adaptive sampling based on historical sensitivity can be used to improve the diversity and targeting of perturbation selection.
[0139]
[0140]
[0141] in, Represented as the first The perturbation type selected for each sample This represents the disturbance level control quantity corresponding to the disturbance type; Indicates that in a given image With instruction text Disturbance type selection distribution under certain conditions This represents the distribution of perturbation degree selection under given perturbation type and sample conditions; the distribution can be implemented by preset rules, random mechanisms or learnable strategies to ensure that the perturbation selection is controllable and scalable.
[0142] S2.3. Based on the selected perturbation type and degree, the corresponding perturbation operator family is called to generate perturbation operator instances for subsequent counterfactual construction. The perturbation type, perturbation operator instances and their control variables are used as control conditions for counterfactual image generation so that the perturbation is consistently applied to the background region under the target region protection constraint. Secondly, to ensure the stability and reproducibility of the training process, the perturbation type identifier and control variable corresponding to each sample can be recorded and replayed or verified for consistency when needed.
[0143]
[0144] in, Indicates the first Instances of perturbation operators corresponding to each sample; This indicates that the perturbation operator instance is applied to the input image. The obtained perturbation results; Indicates the disturbance type as The perturbation operator family in terms of perturbation degree Controlled action on input image The obtained perturbation result satisfies . Represents the set of executable perturbation operator instances. The perturbation operator instance is used to apply noncausal imaging perturbations to the background region in subsequent steps to construct counterfactual images.
[0145] In a specific embodiment of the present invention, the family of perturbation operators and their corresponding non-causal imaging perturbations can be implemented using conventional digital image processing algorithms or physical degradation models. For example, Gaussian blur or downsampling followed by upsampling can be used to simulate resolution degradation; gamma correction and random brightness shifts can be used to simulate radiation and illumination variations; Gaussian white noise or salt-and-pepper noise can be injected to simulate sensor noise; or atmospheric scattering physics models (such as the inverse process of dark channel priors) can be used to superimpose haze effects. The degree of perturbation... This is used to control the kernel size, variance, or strength coefficient of the above algorithm.
[0146] Furthermore, such as Figure 4 As shown, the step of constructing a counterfactual image under the constraint of a protective mask for the target region in step S3 is as follows:
[0147] S3.1. Based on the target area protection mask, the optical remote sensing image is divided into regions to clarify the spatial range of the target area and the background area. The target area is regarded as the semantically preserved protected area, and the background area is regarded as the area where non-causal imaging perturbation is applied. The purpose of the region division is to ensure that the counterfactual construction only changes the imaging appearance factors that are unrelated to the target semantics, so that the counterfactual sample and the original sample are consistent at the target semantic level but have controllable differences at the imaging condition level.
[0148]
[0149] in, Indicates the first A single optical remote sensing image of a sample. Indicates the protective mask for the target area. Indicates the background area mask. This represents a matrix of all ones with the same size as the image. This indicates element-wise multiplication; the above expression is used to divide an image into two parts, a target region and a background region, according to the mask. The background region mask and the target region protection mask are complementary region masks, providing a constraint basis for applying perturbation only to the background region in the future.
[0150] In one specific embodiment of the present invention, the segmentation model may employ a cue-driven visual segmentation base model (such as SAM, Segment Anything Model) or a pre-trained general instance segmentation network (such as Mask R-CNN). In practical applications, the object of interest in the instruction text or target description text can be extracted as a text cue, or the center coordinates can be extracted using an image saliency detection algorithm as a point prompt input to the segmentation model, thereby accurately locating and outputting a target region mask semantically relevant to the current description task. .
[0151] S3.2. Call the perturbation operator instance corresponding to the selected non-causal imaging perturbation type, and apply the non-causal imaging perturbation to the background region under the constraint of the perturbation degree control amount to obtain the image content after background perturbation; wherein the operation of applying the non-causal imaging perturbation only acts on the pixel position covered by the background region mask to avoid damaging the key structure and semantic clues of the target region, and to enable the perturbation effect to be consistently reused to construct comparable counterfactual sample pairs.
[0152]
[0153] in, Indicates the first Background perturbation results obtained for each sample after applying non-causal imaging perturbation. Indicates the first Each sample corresponds to an executable perturbation operator instance. This indicates that the instance of the perturbation operator is composed of perturbation type 1. The perturbation operator family in terms of perturbation degree Input image under control The background perturbation result is obtained by transformation. Used to merge with target area content in subsequent steps to construct counterfactual images.
[0154] S3.3. While keeping the target region unchanged, the original image content of the target region is fused with the perturbed image content of the background region to obtain a counterfactual image. The fusion process follows the combination rule of "target region preservation and background region perturbation" at the pixel / feature level, thereby ensuring that the counterfactual image maintains consistency with the target semantics. At the same time, the imaging condition changes are explicitly introduced to support counterfactual contrast training and causal reinforcement learning.
[0155]
[0156] in, Indicates the first Counterfactual image corresponding to each sample This represents the image content after background perturbation; obtained through the above fusion. With the original image While maintaining consistency in the target region, the image appearance in the background region is varied by the type and degree of perturbation, thus forming counterfactual sample pairs that can be used for consistency training.
[0157] S3.4. The counterfactual image and the original image are paired to form a sample pair and a correspondence is established under the same instruction condition. In the subsequent description generation and consistency constraint calculation, the condition input of the original branch and the counterfactual branch is consistent, so that the model training can be carried out around the goal of "the description semantics remain unchanged under non-causal perturbations". At the same time, the perturbation type identifier and control quantity corresponding to each sample pair can be recorded for training process playback and traceability analysis.
[0158]
[0159] in, Describes the set of counterfactual sample pairs. Indicated in the same instruction text The original image and counterfactual image sample pairs under the given conditions; the set of sample pairs is used for dual-path description generation and counterfactual consistency constraint calculation in subsequent steps.
[0160] Furthermore, such as Figure 5As shown, the steps in S4 for generating paragraph-style descriptions of the original image and the counterfactual image under the same instruction text conditions are as follows:
[0161] S4.1 Generate structural planning information for paragraph descriptions under the constraints of the instruction text. The structural planning information is used to explicitly represent the organization method and content focus between paragraph sentences, so that the subsequent generation process can maintain consistency in terms of "sentence structure, sentence connection, focus and expression style", thereby ensuring that the outputs of the original branch and the counterfactual branch are comparable and providing a stable semantic alignment basis for counterfactual consistency constraints.
[0162]
[0163] in, Indicates the first Paragraph structure planning information for each sample This represents the parameters of a remote sensing multimodal large model. The determined planning generation operator, This represents a single optical remote sensing image. The instruction text is represented; the planning information can be used to limit the sentence organization, content coverage, or expression form of paragraphs to improve the structural stability and conditional consistency of paragraph generation.
[0164] In one specific embodiment of the present invention, the planning generation operator Specifically, it can be constructed from a multilayer perceptron or a lightweight feedforward network incorporating a self-attention mechanism. This operator can serve as an independent prediction head or pre-adaptation module for large-scale remote sensing multimodal models. It receives multimodal features fused from images and instructions as input and maps them to generate structured latent variables that guide inter-sentence logic and content focus, i.e., paragraph structure planning information. The parameters of this operator In subsequent joint optimization steps, it will be updated synchronously with the backbone network.
[0165] S4.2. Multimodal representation extraction and fusion of images and instruction text are performed to map image content and instruction constraints to a unified conditional semantic space, enabling the generator to utilize both image semantic cues and instruction constraints during the decoding stage. The fused conditional representation serves as the conditional input for paragraph-style description generation, ensuring the controllability and consistency of the generated content under the same instruction conditions.
[0166]
[0167] in, Represents the visual encoding operator. Indicates the instruction text encoding operator. Indicates image representation. Indicates the representation of the instruction. This represents the cross-modal fusion operator. This represents the conditional semantic representation after fusion. The information is used to plan the paragraph structure; the fusion representation is used to simultaneously constrain the "generated content" and "paragraph organization method" during the decoding stage, thereby improving the stable output capability of paragraph description.
[0168] In one specific embodiment of the present invention, the visual coding operator It can be implemented using deep convolutional neural networks such as Visual Transformer (ViT) or Residual Network (ResNet); the instruction text encoding operator This can be achieved through word embeddings and feedforward layers of pre-trained large language models (such as LLaMA, Qwen, etc.). The cross-modal fusion operator... It can be composed of a linear projection layer, a multilayer perceptron adapter (MLP adapter), or a cross-attention mechanism.
[0169] S4.3 Under the constraints of the fusion condition representation and paragraph structure planning information, paragraph-style descriptive text is output using a sequence generation method. The paragraph-style descriptive text consists of multiple sentences and can be further represented as a hierarchical structure of "sentence sequence - word sequence" so as to uniformly supervise paragraph-level semantics and inter-sentence structure during training.
[0170]
[0171]
[0172]
[0173] in, This indicates a text decoding generation operator. Indicates the first Paragraph-style descriptions generated from each sample This indicates the number of sentences in the generated paragraph. Indicates the first sentence, Indicates words within a sentence. Indicates the first The word length of the sentence; the hierarchy is used to characterize the inter-sentence organization and intra-sentence expression of the paragraph output, so that subsequent consistency constraints can be stably aligned at the paragraph semantic level.
[0174] In one specific embodiment of the present invention, the text decoding generation operator An autoregressive Transformer decoder based on a self-attention mechanism is employed to achieve word-by-word meta-generation of multiple paragraphs through causal masking.
[0175] S4.4. Under the same instruction text conditions, perform the above paragraph generation process on the original image and the counterfactual image respectively to obtain the original branch description result and the counterfactual branch description result, and establish a one-to-one correspondence between the two under the same condition constraint, so as to be used for subsequent calculation and joint optimization of counterfactual consistency constraints.
[0176]
[0177]
[0178] in, This represents the conditional paragraph generation mapping for a large remote sensing multimodal model. Represents the original image In the instruction text Paragraph description results under certain conditions Counterfactual images Paragraph description results under the same instruction text conditions; by generating two description results under the same input conditions, it can be ensured that the difference between the two mainly comes from the change of imaging conditions rather than the difference of instructions, thus providing reliable input for subsequent counterfactual consistency training.
[0179] Furthermore, such as Figure 6 As shown, the steps in S5, which involve supervising the paragraph description generation process based on the target description text and introducing counterfactual consistency constraints to jointly optimize the remote sensing multimodal large model, are as follows:
[0180] S5.1 Construct description generation supervision constraints based on the generation results of the target description text and the original image branch, so that the model can generate description text that is consistent with the target semantics and has a reasonable paragraph structure under the instruction text condition; wherein the supervision constraints take the word sequence of the paragraph text as the supervision object, and constrain the conditional probability of each step of generation under the autoregressive generation framework to ensure that the generated content is aligned with the target description at the semantic level.
[0181]
[0182] in, This describes the generated supervised loss. These represent the trainable parameters of a large multimodal remote sensing model. Represents the training sample set, This indicates that the model is based on the input image. With instruction text Generate target paragraph description under certain conditions The conditional probability, It represents the mathematical expectation over the training sample set; the supervised loss is used to constrain the consistency between the model output and the target description text, and to provide a stable generation baseline for the subsequent introduction of counterfactual consistency.
[0183] S5.2 Construct counterfactual consistency constraints based on two description results generated from the original image and counterfactual image under the same instruction text conditions, so that the model maintains semantic invariance to non-causal imaging perturbations in a causal sense; wherein the counterfactual consistency constraints are achieved by mapping the two descriptions to a unified semantic space and constraining the consistency of their semantic embedding representation, thereby reducing the model's sensitivity to changes in imaging appearance and avoiding using non-causal perturbations as the basis for description decisions.
[0184]
[0185] in, This indicates a semantic embedding encoder. Indicate its parameters, The paragraph description results representing the original image branches, The paragraph description results representing the counterfactual image branch, and These represent the semantic embedding vectors corresponding to the two descriptions respectively; the semantic embedding encoder can be implemented by the text encoding module of the model to provide comparable semantic representations.
[0186] In a specific embodiment of the present invention, the semantic embedding encoder Pre-trained text representation models, such as the text encoder branch of BERT, RoBERTa, or CLIP models, can be used to extract global syntactic and semantic features rich in contextual information, providing a high-dimensional, comparable continuous vector space for counterfactual consistency constraints.
[0187]
[0188] in, This represents the semantic embedding consistency constraint loss. Describes the set of counterfactual sample pairs. Represents cosine similarity. This represents the mathematical expectation over the set of counterfactual sample pairs; by penalizing inconsistencies in semantic embeddings, this loss enables the model to maintain semantic stability of paragraph descriptions even when background imaging conditions change, thereby achieving the training objective of counterfactual causal enhancement.
[0189] S5.3. Describe the generation supervision constraint and the counterfactual consistency constraint together, and update the model parameters based on the joint objective to simultaneously take into account "generation accuracy" and "causal robustness". The joint objective controls the relative contribution of the two types of constraints through a trade-off term, so that the model can learn a reliable image-text mapping relationship and maintain the invariance of semantic output when non-causal imaging perturbations exist.
[0190]
[0191] in, Indicates the joint training objective. Represents the consistency weight coefficients; by minimizing the joint objective, the model parameters are jointly updated. And optionally update the semantic embedding encoder parameters. This allows for the joint optimization of a large multimodal remote sensing model.
[0192]
[0193] in, This represents the parameter solution after joint optimization; the parameter solution is used in the subsequent inference stage to generate paragraph-style descriptive text under the instruction text conditions, and to maintain semantic consistency when imaging conditions change.
[0194] Furthermore, such as Figure 7 As shown, the step in S6, which uses the jointly optimized remote sensing multimodal large model to generate output paragraph-style descriptive text for a single optical remote sensing image under the condition of instruction text, is as follows:
[0195] S6.1 Obtain the single optical remote sensing image to be described and its corresponding instruction text, and perform normalization and parsing processing on the instruction text in the same way as in the training phase to obtain structured constraint information for controlling the paragraph output form; wherein the structured constraint information is used to limit the sentence structure, sentence organization, focus and expression style of the paragraph output in the inference phase, so that the model can stably output paragraph-style description results that meet the requirements in different application scenarios.
[0196]
[0197] in, This represents the instruction text input during the inference phase. Indicates the instruction normalization operator, This represents the normalized instruction text. Indicates the instruction parsing operator, This represents the structured constraint vector obtained from parsing the instruction text; the structured constraint vector is used to control the output form and content focus of the paragraph description, and is consistent with the conditional semantics in the training phase.
[0198] S6.2. The image to be described and the standardized instruction text and their structured constraint information are combined into a standardized input unit, and then input into the jointly optimized remote sensing multimodal large model to generate paragraph-style description text under the same condition control. The standardized input unit is used to ensure that the data interface in the inference stage is consistent with that in the training stage, so that the model can directly reuse the condition generation capability obtained from training and avoid output instability due to differences in input format.
[0199]
[0200] in, This represents a single optical remote sensing image to be described. This represents the standardized input unit for the reasoning stage, which consists of images. Standardized instruction text and structured constraint vectors The standardized input unit serves as a conditional input to trigger the paragraph-style description generation process.
[0201] S6.3. Based on the standardized input unit, call the jointly optimized remote sensing multimodal large model to generate paragraph-style descriptive text and output it as the reasoning result; wherein the generation process outputs multiple paragraphs under the constraints of structural planning information and multimodal fusion condition representation, so that the output text conforms to the instruction constraints in terms of inter-sentence structure and semantic content, and maintains the stability and consistency of semantic expression when imaging conditions change.
[0202]
[0203] in, The parameter is The jointly optimized remote sensing multimodal large model The paragraph-style descriptive text generated by the model is output as a multi-sentence structure under the constraint of the instruction text, and serves as the final description result of the image to be described.
[0204] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention. The scope of protection of the present invention should be determined by the scope of protection of the appended claims.
Claims
1. A method for generating remote sensing multimodal image descriptions with counterfactual causal enhancement, characterized in that, Includes the following steps: S1. Obtain training samples, which include a single optical remote sensing image and the corresponding target description text, and obtain the instruction text corresponding to the optical remote sensing image. The instruction text is used to limit the content range or expression form of the output description. S2. Set a set of non-causal imaging perturbation types, and select the non-causal imaging perturbation type and its perturbation degree for the training samples; S3. Based on the segmentation model, the optical remote sensing image is segmented into target regions to obtain a target region mask. The target region mask is then subjected to morphological processing to generate a target region protection mask, thereby determining the background region. Under the constraint of the target area protection mask, the non-causal imaging perturbation is applied to the background area to construct a counterfactual image, so that the target area content of the counterfactual image is the same as the target area content of the optical remote sensing image, while the background area has different imaging conditions due to the applied perturbation. S4. Under the same instruction text conditions, input the optical remote sensing image and the counterfactual image into the remote sensing multimodal large model to generate corresponding description results; S5. Based on the target description text, supervise the description generation process and introduce counterfactual consistency constraints to ensure that the description results of the optical remote sensing image and the counterfactual image are semantically consistent, and jointly optimize the remote sensing multimodal large model. S6. Obtain the single optical remote sensing image to be described and its corresponding instruction text, and perform normalization and parsing processing on the instruction text consistent with the training phase to obtain structured constraint information for controlling the paragraph output format; combine the single optical remote sensing image to be described, the normalized instruction text, and the structured constraint information into a standardized input unit for the inference phase, and input it into the jointly optimized remote sensing multimodal large model; based on the standardized input unit for the inference phase, call the jointly optimized remote sensing multimodal large model to generate paragraph-style description text, and output the paragraph-style description text as the inference result.
2. The method for generating counterfactual causal enhanced remote sensing multimodal image descriptions according to claim 1, characterized in that, The steps for obtaining training samples and instruction text in S1 are as follows: S1.1 Construct a training sample set and organize the training data into a pair format of "image - target description - instruction text" to ensure the consistency and traceability of the supervision signal and conditional input during subsequent training; secondly, perform sample integrity verification on the training samples to ensure that each sample contains at least a single optical remote sensing image, the target description text corresponding to the single optical remote sensing image, and the instruction text corresponding to the single optical remote sensing image, thereby providing a unified data interface for counterfactual sample construction and consistency constraint calculation; in, Represents the training sample set, Indicates the first A training sample triplet, This indicates a single optical remote sensing image input. Indicates and The corresponding target description text, Indicates and The corresponding instruction text, Indicates the number of training samples. For sample index; S1.
2. The target description text is represented in paragraph structure so that the target description can explicitly depict the hierarchical structure of "paragraphs are composed of multiple sentences and sentences are composed of word sequences". This allows for the supervision of the overall paragraph semantics during subsequent training, as well as the consistency constraints on the organization between sentences and the expression within sentences when needed, thereby improving the stability and controllability of paragraph generation. in, Indicates the number of sentences in a paragraph. Indicates the first Sentence text, Indicates the first The first in the sentence Each word element, Indicates the first The word length of the sentence; the above representation is used to uniformly model paragraph-level supervision signals during the training phase and to provide standardized text objects for subsequent semantic consistency calculations; S1.3 Obtain and standardize the instruction text, represent the instruction text as a word sequence and perform format unification and field standardization processing. The standardization processing is used to eliminate instruction noise and unify the instruction expression form, so that the same task constraint has consistent conditional semantics in different samples, thereby ensuring that the model can generate comparable paragraph description results under the same conditions when the original image and counterfactual image are input. in, This indicates the first in the instruction text. Each word element, Indicates the length of the instruction text tokens. Indicates the instruction normalization operator, This represents the original instruction text. This represents the normalized instruction text; the normalization operator can be used to achieve instruction templated formatting, reduction of redundant information, and unified expression of key constraint fields. S1.
4. Structured constraint information is obtained from the normalized instruction text and used as a conditional control quantity in the paragraph description generation process to achieve unified control over sentence structure, paragraph organization, focus, and expression style. At the same time, the image, normalized instruction, and structured constraint are combined into a standardized input unit so that the counterfactual sample construction, dual-path description generation, and consistency constraint calculation in the subsequent training stage are all based on the same symbol system. in, Indicates the instruction parsing operator, Represents the structured constraint vector, where This indicates sentence count or length constraints. This indicates paragraph structure constraint information. This indicates the constraint information of the object of interest. This represents the expression style constraint information; the structured constraint vector is used to consistently control the output form and content focus of paragraph descriptions during the training and inference phases; then, a single optical remote sensing image is... Standardized instruction text and structured constraint vectors Composed of standardized input units; in, This represents a standardized input unit, which consists of a single optical remote sensing image. Standardized instruction text and structured constraint vectors constitute; Because the training dataset uses triples ,in Indicates and The corresponding target description text, therefore each triple can be... The "instruction normalization and parsing" is used to map the equivalent units of supervised training. This allows subsequent counterfactual sample construction, dual-path paragraph description generation, and consistency constraint calculation to be completed under a unified input / output interface.
3. The method for generating counterfactual causal enhanced remote sensing multimodal image descriptions according to claim 1, characterized in that, The steps in S2 for setting the set of non-causal imaging perturbation types and selecting the perturbation type and its degree for the training samples are as follows: S2.1 Construct a set of non-causal imaging perturbation types to characterize imaging perturbation factors that are unrelated to the target semantics but cause changes in the appearance of the image, and bind each type of perturbation to an executable perturbation operator family to ensure that the corresponding perturbation can be called with a unified interface in the subsequent counterfactual image construction; secondly, standardize the perturbation type set so that it can be selected and recorded in a consistent identification manner in different training batches and different samples, thereby supporting subsequent robust training and traceable analysis; in, This represents the set of non-causal imaging perturbation types. Indicates the first Types of perturbations Indicates the number of disturbance types. Index for disturbance type; Denotes the set of perturbation operators. Indication and Disturbance Type The corresponding perturbation operator family or perturbation generation rule is used to apply the selected perturbation to a specified area of the image; S2.2 For each training sample, select at least one perturbation type from the set of non-causal imaging perturbation types according to the sample content and instruction conditions, and determine the perturbation degree corresponding to the perturbation type; wherein the perturbation degree is used to control the magnitude of the perturbation's influence on the background appearance, thereby forming a counterfactual sample distribution covering different imaging condition variations; secondly, random sampling, strategy sampling, or adaptive sampling based on historical sensitivity can be used to improve the diversity and targeting of perturbation selection; in, Represented as the first The perturbation type selected for each sample This represents the disturbance level control quantity corresponding to the disturbance type; Indicates that in a given image With instruction text Disturbance type selection distribution under certain conditions This represents the distribution of perturbation degree selection under given perturbation type and sample conditions; the distribution can be implemented by preset rules, random mechanisms or learnable strategies to ensure that the perturbation selection is controllable and scalable. S2.
3. Based on the selected perturbation type and degree, the corresponding perturbation operator family is called to generate perturbation operator instances for subsequent counterfactual construction. The perturbation type, perturbation operator instances and their control variables are used as control conditions for counterfactual image generation so that the perturbation is consistently applied to the background region under the target region protection constraint. Secondly, to ensure the stability and reproducibility of the training process, the perturbation type identifier and control variable corresponding to each sample can be recorded, and replay or consistency verification can be performed when needed. in, Indicates the first Instances of perturbation operators corresponding to each sample; This indicates that the perturbation operator instance is applied to the input image. The obtained perturbation results; Indicates the disturbance type as The perturbation operator family in terms of perturbation degree Controlled action on input image The obtained perturbation result satisfies ; Represents the set of executable perturbation operator instances. The perturbation operator instance is used to apply noncausal imaging perturbation to the background region in subsequent steps to construct counterfactual images.
4. The method for generating counterfactual causal enhanced remote sensing multimodal image descriptions according to claim 1, characterized in that, The step in S3 of constructing a counterfactual image under the constraint of a protective mask for the target region is as follows: S3.
1. Based on the target area protection mask, the optical remote sensing image is divided into regions to clarify the spatial range of the target area and the background area. The target area is regarded as the semantically preserved protected area, and the background area is regarded as the area where non-causal imaging perturbation is applied. The purpose of the region division is to ensure that the counterfactual construction only changes the imaging appearance factors that are unrelated to the target semantics, so that the counterfactual sample and the original sample are consistent at the target semantic level but have controllable differences at the imaging condition level. in, Indicates the first A single optical remote sensing image of a sample. Indicates the protective mask for the target area. Indicates the background area mask. This represents a matrix of all ones with the same size as the image. This indicates element-wise multiplication; the above expression is used to divide the image into two parts, the target region and the background region, according to the mask. The background region mask and the target region protection mask are complementary region masks, providing a constraint basis for applying perturbation only to the background region in the future. S3.2, Invoke the perturbation operator instance corresponding to the selected non-causal imaging perturbation type, and apply the non-causal imaging perturbation to the background region under the constraint of the perturbation degree control amount to obtain the image content after background perturbation; wherein the operation of applying the non-causal imaging perturbation only acts on the pixel position covered by the background region mask to avoid damaging the key structure and semantic clues of the target region, and to enable the perturbation effect to be consistently reused to construct comparable counterfactual sample pairs; in, Indicates the first Background perturbation results obtained for each sample after applying non-causal imaging perturbation. Indicates the first Each sample corresponds to an executable perturbation operator instance. This indicates that the instance of the perturbation operator is composed of perturbation type 1. The perturbation operator family in terms of perturbation degree Input image under control The background perturbation result is obtained by transformation. Used in subsequent steps to merge with content from the target area to construct counterfactual images; S3.
3. While keeping the target region unchanged, the original image content of the target region is fused with the perturbed image content of the background region to obtain a counterfactual image. The fusion process follows the combination rule of "target region preservation and background region perturbation" at the pixel / feature level to ensure that the counterfactual image maintains consistency with the target semantics. At the same time, the imaging condition changes are explicitly introduced to support counterfactual contrast training and causal reinforcement learning. in, Indicates the first Counterfactual image corresponding to each sample This represents the image content after background perturbation; obtained through the above fusion. With the original image While maintaining consistency in the target region, the image appearance in the background region is varied by the type and degree of perturbation, thus forming counterfactual sample pairs that can be used for consistency training. S3.
4. The counterfactual image and the original image are paired to form a sample pair and a correspondence is established under the same instruction condition. In the subsequent description generation and consistency constraint calculation, the condition input of the original branch and the counterfactual branch is consistent, so that the model training can be carried out around the goal of "the description semantics remain unchanged under non-causal perturbations". At the same time, the perturbation type identifier and control quantity corresponding to each sample pair can be recorded for training process playback and traceability analysis. in, Describes the set of counterfactual sample pairs. Indicated in the same instruction text The original image and counterfactual image sample pairs under the given conditions; the set of sample pairs is used for dual-path description generation and counterfactual consistency constraint calculation in subsequent steps.
5. The method for generating counterfactual causal enhanced remote sensing multimodal image descriptions according to claim 1, characterized in that, The steps in S4, which generate paragraph-style descriptions of the original image and the counterfactual image under the same instruction text conditions, are as follows: S4.1 Generate structural planning information for paragraph descriptions under the constraints of the instruction text. The structural planning information is used to explicitly represent the organization method and content focus between paragraph sentences, so that the subsequent generation process can maintain consistency in terms of "sentence structure, sentence connection, focus and expression style", thereby ensuring that the outputs of the original branch and the counterfactual branch are comparable and providing a stable semantic alignment basis for counterfactual consistency constraints. in, Indicates the first Paragraph structure planning information for each sample This represents the parameters of a remote sensing multimodal large model. The determined planning generation operator, This represents a single optical remote sensing image. The instruction text is represented; the planning information can be used to limit the inter-sentence organization, content coverage, or expression form of paragraphs, so as to improve the structural stability and conditional consistency of paragraph generation. S4.
2. Multimodal representation extraction and fusion of images and instruction text are performed to map image content and instruction constraints to a unified conditional semantic space, so that the generator can utilize image semantic cues and instruction constraints simultaneously during the decoding stage; the fused conditional representation serves as the conditional input for paragraph-style description generation to ensure the controllability and consistency of the generated content under the same instruction conditions. in, Represents the visual encoding operator. Indicates the instruction text encoding operator. Indicates image representation. Indicates the representation of the instruction. This represents the cross-modal fusion operator. This represents the conditional semantic representation after fusion. The information is used to plan the paragraph structure; the fusion representation is used to simultaneously constrain the "generated content" and "paragraph organization method" during the decoding stage, thereby improving the stable output capability of paragraph description; S4.3 Under the constraints of the fusion condition representation and paragraph structure planning information, paragraph-style descriptive text is output using a sequence generation method. The paragraph-style descriptive text consists of multiple sentences and can be further represented as a hierarchical structure of "sentence sequence - word sequence" so as to uniformly supervise paragraph-level semantics and inter-sentence structure during training. in, This indicates a text decoding generation operator. Indicates the first Paragraph-style descriptions generated from each sample This indicates the number of sentences in the generated paragraph. Indicates the first sentence, Indicates words within a sentence. Indicates the first The word length of the sentence; the hierarchy is used to characterize the inter-sentence organization and intra-sentence expression of the paragraph output, so that subsequent consistency constraints can be stably aligned at the paragraph semantic level; S4.
4. Under the same instruction text conditions, perform the above paragraph generation process on the original image and the counterfactual image respectively to obtain the original branch description result and the counterfactual branch description result, and establish a one-to-one correspondence between the two under the same condition constraint, so as to be used for subsequent calculation and joint optimization of counterfactual consistency constraints. in, This represents the conditional paragraph generation mapping for a large remote sensing multimodal model. Represents the original image In the instruction text Paragraph description results under certain conditions Counterfactual images Paragraph description results under the same instruction text conditions; by generating two description results under the same input conditions, it can be ensured that the difference between the two mainly comes from the change of imaging conditions rather than the difference of instructions, thus providing reliable input for subsequent counterfactual consistency training.
6. The method for generating remote sensing multimodal image descriptions with counterfactual causal enhancement according to claim 1, characterized in that, The steps in S5 for supervising the paragraph description generation process based on the target description text and introducing counterfactual consistency constraints to jointly optimize the remote sensing multimodal large model are as follows: S5.1 Construct description generation supervision constraints based on the generation results of the target description text and the original image branch, so that the model can generate description text that is consistent with the target semantics and has a reasonable paragraph structure under the instruction text condition; wherein the supervision constraints take the word sequence of the paragraph text as the supervision object, and constrain the conditional probability of each step of generation under the autoregressive generation framework to ensure that the generated content is aligned with the target description at the semantic level. in, This describes the generated supervised loss. These represent the trainable parameters of a large multimodal remote sensing model. Represents the training sample set, This indicates that the model is based on the input image. With instruction text Generate target paragraph description under certain conditions The conditional probability, It represents the mathematical expectation over the training sample set; the supervised loss is used to constrain the consistency between the model output and the target description text, and to provide a stable generation baseline for the subsequent introduction of counterfactual consistency; S5.2 Construct counterfactual consistency constraints based on two description results generated from the original image and counterfactual image under the same instruction text conditions, so that the model maintains semantic invariance to non-causal imaging perturbations in a causal sense; wherein the counterfactual consistency constraints are achieved by mapping the two descriptions to a unified semantic space and constraining the consistency of their semantic embedding representation, thereby reducing the model's sensitivity to changes in imaging appearance and avoiding using non-causal perturbations as the basis for description decisions. in, This indicates a semantic embedding encoder. Indicate its parameters, The paragraph description results representing the original image branches, The paragraph description results representing the counterfactual image branch, and These represent the semantic embedding vectors corresponding to the two descriptions; the semantic embedding encoder can be implemented by the text encoding module of the model to provide comparable semantic representations; in, This represents the semantic embedding consistency constraint loss. Describes the set of counterfactual sample pairs. Represents cosine similarity. It represents the mathematical expectation over the set of counterfactual sample pairs; this loss, by penalizing the inconsistency of semantic embeddings, enables the model to maintain the semantic stability of paragraph descriptions as background imaging conditions change, thereby achieving the training objective of counterfactual causal enhancement. S5.
3. Describe the generation supervision constraint and the counterfactual consistency constraint together, and update the model parameters based on the joint objective to simultaneously take into account "generation accuracy" and "causal robustness". The joint objective controls the relative contribution of the two types of constraints through a trade-off term, so that the model can learn a reliable image-text mapping relationship and maintain the invariance of semantic output when non-causal imaging perturbations exist. in, Indicates the joint training objective. Represents the consistency weight coefficients; by minimizing the joint objective, the model parameters are jointly updated. And optionally update the semantic embedding encoder parameters. This allows for the joint optimization of a large multimodal remote sensing model. in, This represents the parameter solution after joint optimization; the parameter solution is used in the subsequent inference stage to generate paragraph-style descriptive text under the instruction text conditions, and to maintain semantic consistency when imaging conditions change.