Deep learning co2 near real-time inversion method and system fusing multi-scale features

By integrating deep learning methods with multi-scale features, combined with rotational position encoding and learnable attention pooling, an efficient CO2 inversion model was constructed, which solved the problem of limited accuracy in CO2 inversion under complex atmospheric conditions and achieved high-precision and robust CO2 inversion.

CN121543465BActive Publication Date: 2026-06-23HEFEI INSTITUTE OF PHYSICAL SCIENCE CHINESE ACADEMY OF SCIENCES

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEFEI INSTITUTE OF PHYSICAL SCIENCE CHINESE ACADEMY OF SCIENCES
Filing Date
2026-01-20
Publication Date
2026-06-23

Smart Images

  • Figure CN121543465B_ABST
    Figure CN121543465B_ABST
Patent Text Reader

Abstract

The application discloses a deep learning CO2 near real-time inversion method and system fusing multi-scale features, wherein observed spectral data is input into an inversion model, near real-time calculation can be performed in orbit, and a satellite data inversion result is output; the inversion model comprises two-stage full connection projection, a Transformer encoder, a learnable attention pooling and a regression decoder which are sequentially connected; wherein each layer of the encoder comprises two sublayers which are sequentially connected, namely a multi-head self-attention layer and a feedforward network layer, a residual connection layer and a normalization layer are introduced after each sublayer, in the multi-head self-attention layer, rotation position coding is applied to Q and K before entering dot product attention, and relative displacement is implicitly injected into attention weight calculation through phase rotation; the inversion method adopts a CO2 data driven inversion idea characterized by satellite observation and supervised by ground in-situ observation, maintains calculation efficiency, and improves modeling capability for nonlinear relationships between features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of environmental monitoring technology, and in particular to a deep learning method and system for near real-time CO2 inversion that integrates multi-scale features. Background Technology

[0002] The sustained rise in atmospheric carbon dioxide (CO2) concentration is considered a key driver of global warming, and long-term, stable, and comparable monitoring is fundamental for understanding climate change processes and developing mitigation strategies. While ground-based observations offer advantages in accuracy and source tracing, their global coverage is limited by station distribution and representativeness. Therefore, satellite remote sensing—especially short-wave infrared (SWIR) observations—provides the possibility of obtaining near-global column-average CO2 concentrations, demonstrating unique value in terms of sensitivity and spatiotemporal coverage.

[0003] Early work on CO2 retrieval primarily relied on the physical framework of solving the radiative transfer equation, demonstrating good interpretability and consistency under clean atmospheric conditions. However, under complex conditions such as aerosol loading and temperature profile variations, computational costs and uncertainties increase significantly. In recent years, data-driven methods have been gradually introduced into CO2 retrieval: models based on satellite data such as OCO-2, including Deep Neural Networks (DNNs), Long Short-Term Memory Networks (LSTMs), and Transformers, have achieved empirical progress in accuracy and efficiency, indicating that deep learning can alleviate the computational bottlenecks and non-uniqueness issues of traditional retrieval to some extent. However, existing research still faces several limitations: firstly, the observational signal-to-noise ratio and spectral resolution constrain retrieval accuracy under complex atmospheric conditions; secondly, the spatiotemporal sparsity of high-quality labels limits model generalization; and thirdly, spectral / format differences between different tasks and payloads increase the difficulty of method transfer. Summary of the Invention

[0004] Based on the technical problems existing in the background technology, this invention proposes a deep learning CO2 near real-time inversion method and system that integrates multi-scale features, which shows good accuracy and generalization ability in CO2 prediction tasks.

[0005] The deep learning CO2 near real-time inversion method proposed in this invention integrates multi-scale features. It inputs the observed spectral data into the inversion model, can perform near real-time calculations in orbit, and output satellite data inversion results.

[0006] The inversion model includes a two-level fully connected projection, a Transformer encoder, a learnable attention pooling, and a regression decoder connected in sequence.

[0007] Each layer of the encoder includes two sequentially connected sub-layers: a multi-head self-attention layer and a feedforward network layer. Each sub-layer is followed by a residual connection layer and a normalization layer. In the multi-head self-attention layer, rotational position encoding is applied to Q and K before entering the dot product attention. The relative displacement is implicitly injected into the attention weight calculation through phase rotation.

[0008] Furthermore, in the encoder, rotational position encoding is applied to Q and K before entering the dot product attention, specifically as follows:

[0009] After applying rotational position encoding to Q and K respectively, matrix multiplication is performed. The output features after passing through the scaling layer and activation layer are then multiplied by V to complete the dot product attention calculation.

[0010] Furthermore, the learnable attention pooling is used to adaptively aggregate the encoded features of the encoder output in the sequence dimension to obtain aggregated information, specifically as follows:

[0011] Record No. The hidden state at each time step is Scalar scores are obtained through nonlinear extraction and data compression. The weights are obtained by softmax along the sequence dimension. And then The context vector is obtained by weighted summation. , as aggregated information.

[0012] Furthermore, the training process of the inversion model is as follows:

[0013] A training set was constructed using the spectral data of OCO-2 as input features and in-situ ground observations as labels.

[0014] The training set is projected through a two-level fully connected layer and then input into the Transformer encoder. After applying rotational position encoding to Q and K, dot product attention is calculated. The encoded features output by the encoder are adaptively aggregated in the sequence dimension through a learnable attention pooling layer, and then passed through a regression decoder and the output head to obtain the prediction result.

[0015] Construct a loss function to adjust the trainable parameters of the inversion model.

[0016] Furthermore, the construction of the training set specifically includes:

[0017] A dataset was constructed using the OCO-2 spectrum as input features and ground-based in-situ observations as labels in a cross-validation framework.

[0018] A dataset is constructed by setting co-localization conditions with "dual threshold" constraints on input features and labels. The co-localization conditions include a window interval of no more than ±A minutes and a spatial distance of no more than B degrees, where A and B are positive numbers.

[0019] The dataset is divided into a training set, a validation set, and a test set, with the test set isolated from the training set and the validation set on the time axis, respectively.

[0020] The original spectral channels in the dataset were filtered out based on radiative transfer interpretability and signal-to-noise ratio;

[0021] A set of auxiliary scalar features are introduced: solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, cloud markers, and prior values ​​of XCO2.

[0022] Independent Gaussian noise with a mean of 0 and a standard deviation of 0.02 is injected into the observed features in the training set, but no noise is added to the labels; no noise is injected into the validation set and the test set.

[0023] The labels participate in the loss at the physical quantity scale during training, and are denormalized back to the original physical units during the validation and visualization phases.

[0024] Fit the mean on the training set with standard deviation and the same group It is applied unbiasedly to the validation and test sets.

[0025] Furthermore, the trained inversion model is fine-tuned, specifically as follows:

[0026] Construct a dataset for model fine-tuning, and fix the stable linear relationship between the prior value of XCO2 and the target variable as an interpretable principal term;

[0027] The inversion model learns only the nonlinear residuals not covered by the principal term and performs lightweight, strictly leak-free adaptive calibration on the target domain.

[0028] Furthermore, the fixed use of a stable linear relationship between the prior value of XCO2 and the target variable as the interpretable principal term specifically refers to:

[0029] Let the source domain dataset be The target domain is , and The first and The data includes solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, and physical quantities of cloud markers. and The first and Each data point corresponds to a priori value of XCO2. and The first and Labels for each data item;

[0030] Linear principal terms of the prior values ​​of XCO2 are fitted in the source domain using least squares: ,by A first-order interpretation of the label y is given by fixing the prior values ​​characterizing XCO2, where For tags, and As a parameter, This is the prior value of XCO2;

[0031] Define residual : ;

[0032] Estimate its standardized parameters in the source domain. Based on this, the standardization target is obtained. : ;

[0033] Training phase only learning The nonlinear mapping is then linearly restored during the inference stage to obtain the predicted value. : ,in These are the predicted values ​​of the standardized residuals;

[0034] Should The decomposition process of solving the global regression problem that is susceptible to domain shifts is simplified to "robust first-order principal term + transferable second-order correction".

[0035] Furthermore, the implementation of lightweight, strictly leak-free adaptive calibration in the target domain specifically includes:

[0036] The first stage is to freeze the Transformer encoder and only update the output head to complete the coarse calibration of the residual distribution in the target domain with a large learning rate;

[0037] The second stage unfreezes the entire inversion model, refines the parameters with a learning rate lower than the set learning rate threshold, and uses early stopping to suppress overfitting.

[0038] Two-layer nonlinear calibration: A two-layer calibrator is introduced on a subset of the target domain training data. A weak learner based on histogram gradient boosting regression is used to learn the error correction term. The calibrator input is , This is the first layer of prediction after two-stage fine-tuning. The input features are after interpolation and standardization.

[0039] Furthermore, during the fine-tuning process, a mechanism is used to inject random relative perturbations or randomly replace the prior value of XCO2 with the sample median, while ensuring that the trend of the main term remains unchanged.

[0040] A computer system includes a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method described above.

[0041] The advantages of the deep learning-based near real-time CO2 inversion method and system integrating multi-scale features provided by this invention are as follows: It employs a CO2 data-driven inversion approach, using satellite observations as features and ground-based in-situ observations as supervision, and combines this with a set "dual threshold" constraint to obtain high-quality paired samples. The inversion model consists of modules such as input projection, an enhanced Transformer encoder, attention pooling, and a regression decoder. Based on this, the inversion model works synergistically in high-dimensional data scenarios, maintaining computational efficiency while further enhancing its ability to model nonlinear relationships between features. Attached Figure Description

[0042] Figure 1 This is a schematic diagram of the structure of the present invention;

[0043] Figure 2 This is a schematic diagram of the error structure and consistency overview (validation set), where (a) is the point cloud along... (a) is a schematic diagram of highly clustered lines, slopes close to 1, intercepts close to 0, and no systematic bends or "funnel-shaped" distribution. (b) is a schematic diagram of the median absolute error MdAE, the 90th percentile absolute error (P90), and the 95th percentile absolute error (P95).

[0044] Figure 3 This diagram illustrates the parallel validation comparison results of the inversion model, the standard Transformer baseline model, and the general FT-Transformer regression model in this embodiment. (a) shows the validation results of the inversion model and the standard Transformer baseline model in this embodiment, and (b) shows the validation results of the inversion model and the general FT-Transformer regression model in this embodiment.

[0045] Figure 4 To divide the proportions Schematic diagram of the impact;

[0046] Figure 5 The images show the partitioning effects of different seeds before fine-tuning the target domain. Among them, (a) is the partitioning effect of random seed 2025, (b) is the partitioning effect of random seed 2026, and (c) is the partitioning effect of random seed 2027.

[0047] Figure 6 The images show the partitioning effects of different seeds after fine-tuning the target domain. Among them, (a) is the partitioning effect of random seed 2025, (b) is the partitioning effect of random seed 2026, and (c) is the partitioning effect of random seed 2027.

[0048] Figure 7 The diagrams show the results of the inversion model and the traditional method in this implementation, respectively. (a) is a diagram showing the correlation analysis of the OCO-2 satellite data inversion results with TCCON when using the traditional method, and (b) is a diagram showing the correlation analysis of the inversion model results with TCCON.

[0049] Figure 8 This is a schematic diagram of the analysis results for a sudden event. Detailed Implementation

[0050] The technical solution of the present invention will now be described in detail through specific embodiments. Many specific details are set forth in the following description to provide a thorough understanding of the invention. However, the present invention can be implemented in many other ways different from those described herein, and those skilled in the art can make similar modifications without departing from the spirit of the invention. Therefore, the present invention is not limited to the specific embodiments disclosed below.

[0051] like Figures 1 to 8 As shown, the deep learning CO2 near real-time inversion method proposed in this invention, which integrates multi-scale features, inputs the observed spectral data into the inversion model, and can perform near real-time calculations in orbit to output satellite data inversion results.

[0052] The inversion model includes two levels of fully connected projection, a Transformer encoder, a learnable attention pooling, and a regression decoder connected in sequence. Each layer of the encoder includes two sub-layers connected in sequence: a multi-head self-attention layer and a feedforward network layer. Each sub-layer is followed by a residual connection layer and a normalization layer. In the multi-head self-attention layer, rotational position encoding is applied to Q and K before entering the dot product attention. The relative displacement is implicitly injected into the attention weight calculation through phase rotation.

[0053] The enhanced Transformer regression model proposed in this embodiment combines rotational position encoding and attention pooling: first, the temporal representation is extracted through a two-layer fully connected input projection and a multi-layer Transformer encoder, then key information is focused through attention pooling, and finally the CO2 prediction is output by the deep regression head.

[0054] This embodiment, through rigorous co-location data construction and an enhanced Transformer driven by "primary term-residual" decomposition, can significantly improve the accuracy and robustness of satellite CO2 inversion while maintaining physical constraints. Compared with existing studies that mainly rely on physical solutions or do not explicitly handle distribution drift, this embodiment achieves a balance in computational efficiency, error control, and interpretability, providing a feasible path for building a high spatiotemporal resolution CO2 monitoring process for business applications.

[0055] To improve regression performance and prediction accuracy, this embodiment proposes an enhanced Transformer-based inversion model within the standard Transformer framework. This inversion model retains the core architecture while introducing several improvements, enabling it to more fully represent the complex dependencies of the input in a multi-dimensional feature space while maintaining inference efficiency. Specifically, the inversion model consists of modules such as input projection, an enhanced Transformer encoder, learnable attention pooling, and a regression decoder. Furthermore, these enhancements work synergistically in high-dimensional data scenarios: maintaining computational efficiency while further improving the ability to model nonlinear relationships between features.

[0056] I. The inversion model as a whole adopts an end-to-end framework of "input projection - Transformer encoder - attention pooling - regression decoder" to achieve CO2 (ppm) estimation. For example... Figure 2 As shown, all inputs are first processed in a standardized space; then, in the inference and evaluation phase, the predictions and labels are uniformly de-standardized back to the physical quantity scale to ensure the physical readability of the indicators and the comparability between different datasets.

[0057] (a1) Input projection;

[0058] To alleviate the representational bottleneck caused by the low dimensionality of the original features, a two-stage fully connected projection is implemented at the input: the first layer consists of a batch normalization layer (BatchNorm), an activation function layer (ReLU), and a regularization layer (Dropout) to improve numerical stability and suppress overfitting; the second layer linearly reduces the dimensionality to the hidden dimension d. Weights are initialized using Xavier Uniform, thereby expanding the effective representation capacity and mitigating the optimization difficulties caused by different dimensions / scales without significantly increasing computational burden. Xavier Uniform initialization is a method for initializing neural network weights, designed to balance the gradient distribution during forward and backward propagation, thus alleviating the vanishing or exploding gradient problem.

[0059] Additionally, it should be noted that StandardScaler is set at the input of the two-level fully connected projection. StandardScaler is a tool used in data preprocessing to standardize feature values. Its core function is to convert the data into a standard normal distribution with a mean of 0 and a standard deviation of 1.

[0060] (a2) Transformer encoder;

[0061] The encoding part is composed of The Transformer encoders are stacked, with each layer containing a multi-head attention layer and a feedforward network (FFN). Each sub-layer is followed by an add-layernorm to stabilize deep training. Specifically, after the multi-head attention layer, a dropout layer and a residual connection layer are sequentially placed. This residual connection layer is then connected to the feedforward network layer after layer normalization. After the feedforward network layer, a regularization layer and a residual connection layer are sequentially placed. This residual connection layer is then connected to a learnable pooling attention layer after layer normalization.

[0062] In the multi-head self-attention layer, to characterize the relative position information of the sequence (the feature sequence obtained after input projection), a rotation position encoding (RoPE) is applied to Query and Key before entering the dot product attention layer. Phase rotation implicitly injects the relative displacement into the attention weight calculation, avoiding the extra parameters and extrapolation limitations caused by explicit position vectors. The implementation uniformly adopts a sequence-first layout. ,in As a dimension, For batches, The sequence length ensures consistency between the attention and normalization dimensions. A learnable attention pooling is built on top of the Transformer encoder output to adaptively aggregate information along the sequence dimension.

[0063] Specifically, the rotation position encoding (RoPE) involves applying rotation position encoding to Q and K respectively, followed by matrix multiplication. The output features after scaling and activation layers are then multiplied by V to complete the dot product attention calculation.

[0064] (a3) Learnable attention pooling layer;

[0065] Record No. The hidden state at each time step / sequence position is The scalar score is obtained through nonlinear extraction and data compression (i.e., linear-Tanh-linear mapping). The weights are obtained by softmax along the sequence dimension. And then The context vector is obtained by weighted summation. This serves as aggregated information. When the sequence length... At this time, the attention pooling layer naturally degenerates into an identity mapping, thereby maintaining a uniform forward path under short sequence settings such as spectral channel sequences.

[0066] (a4) Regression decoder;

[0067] The regression decoder employs a multilayer perceptron (each hidden layer is equipped with BatchNorm, ReLU activation function, and Dropout regularization), with the final layer providing linear output scalar prediction. The evaluation is denormalized at the same scale as the labels, and the error and consistency index are calculated in ppm to ensure the physical readability and cross-dataset comparability of the results.

[0068] II. Training the inversion model;

[0069] The training process of the inversion model is as follows: using the OCO-2 spectrum as input features and ground-based in-situ observations as labels, a training set is constructed; the training set is projected through two levels of fully connected layers and then input into a Transformer encoder; rotational position encoding is applied to Q and K, and dot product attention is calculated; the encoded features output by the encoder are adaptively aggregated in the sequence dimension through a learnable attention pooling layer, and the prediction results are obtained after passing through a regression decoder and the output head; a loss function is constructed to adjust the trainable parameters of the inversion model.

[0070] (b1) Construction of training, validation, and test sets;

[0071] First, OCO-2 operates in a sun-synchronous orbit (revisiting approximately every 16 days) and carries a coaxial three-channel high-resolution grating spectrometer covering O2-A (0.765 μm), the weak CO2 band (1.61 μm), and the strong CO2 band (2.06 μm). It provides on-orbit "eight-footprint" parallel imaging with approximately 1016 discrete channels per spectral band. Based on this, OCO-2 can provide near-global hyperspectral observations in orbit.

[0072] This embodiment takes satellite-ground "strong constraint" co-location as its starting point and proposes a CO2 data-driven inversion approach characterized by satellite observations and supervised by ground-based in-situ observations. Specifically, it selects OCO-2 SWIR spectra and auxiliary variables such as geometry and aerosols as inputs, and directly uses TCCON observations with sufficient temporal and spatial overlap with satellite transits as the target output. TCCON (Total Carbon Observatory Global) is a ground-based high-resolution infrared spectroscopy observation network mainly used to monitor the concentration and distribution of greenhouse gases in the atmosphere. The co-location condition adopts "dual threshold" constraints (time window not exceeding ±A (e.g., A=30) minutes, spatial distance not exceeding B (e.g., B=0.5°), where A and B are positive numbers) to ensure the comparability of the two samplings of the same atmospheric volume as much as possible. After unified quality control and cloud screening, 22,888 high-quality paired samples from 2015 to 2018 were obtained, providing a reliable benchmark for model training and validation.

[0073] Specifically, the supervised data construction employs a cross-validation framework using OCO-2 products as input and TCCON ground observations as labels to build the dataset. To enhance data coverage and generalization capabilities, labels are derived from over twenty TCCON sites distributed globally. A co-location principle of "time ±30 minutes + spatial radius ≤ 0.5°" is used: samples whose footprint centers fall within a 0.5° circle within ±30 minutes of the satellite transit time, centered on the TCCON site's latitude and longitude. Compared to common wider windows (e.g., ±1 hour with latitude ±2° / longitude ±2.5°, or even wider ±2 hours, ±5°), the criteria in this embodiment are more stringent; if multiple candidate footprints exist, the sample with the "closest great circle distance and the most stringent cloud screening" is retained. If the same satellite footprint corresponds to multiple TCCON records, the TCCON data is linearly interpolated to the transit time to reduce label bias caused by time mismatch.

[0074] After co-location, quality control is performed to remove samples that fail the cloud / scattering screening or have low / suspicious spectral quality. Further exclusion is made for records with extreme geometric conditions (such as excessively large solar zenith angles) or high missing measurement rates. Considering that topographic and gauge pressure errors are almost "1:1" transferred to CO2, special attention is paid to suspicious samples related to digital elevation models (DEMs) and air pressure to reduce the risk of systematic bias. At the same time, the differences in quality indicators and screening processes across different retrieval systems (such as RemoTeC) are referenced to ensure consistency between labels and input.

[0075] After the above screening, a total of N=22,888 samples were obtained from 2015 to 2018 for training / internal validation. In 2019, an independent test set of N=6,728 was built and completely isolated on the timeline to closely match the deployment scenario of "training with history - applying in the future".

[0076] Regarding feature construction, as shown in Table 1, the original spectral channels in the dataset were first screened based on radiative transfer interpretability and signal-to-noise ratio: channels at the ends of spectral bands susceptible to edge effects, channels with known instrument anomalies / saturation, and unstable channels with low SNR were removed, retaining only reliable channels for subsequent modeling. Following traditional physics retrieval practices, a set of auxiliary scalar features was introduced to enhance contextual characterization and interpretability: solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, cloud markers, and the apriori value of XCO2. Cloud markers were used to remove samples significantly contaminated by clouds during the training / validation phase, and the apriori value of XCO2 was used as a stable principal term or weak prior constraint. All numerical input features were standardized using StandardScaler: the mean and standard deviation were fitted on the training set. and the same group It is applied unbiased to the validation and test sets; the labels (TCCONCO2) participate in the loss at the physical quantity scale during training, and are denormalized back to the original physical units during the validation and visualization phases to ensure the physical readability of the metrics.

[0077] To enhance the model's robustness to measurement disturbances and unmodeled noise, independent Gaussian noise with a mean of 0 and a standard deviation of 0.02 is injected into the input features only during the training phase, without adding noise to the labels; no noise is injected during the validation and testing phases.

[0078] Missing values ​​(if present) are initially removed at the feature level or processed using simple imputation based on training set statistics (e.g., mean imputation grouped by station / track), and a missing value indicator is retained in the data record for error diagnosis. The entire preprocessing and standardized statistics ( , The filter channel index and missing test mask rules are all fixed and saved along with the experimental configuration to ensure complete reproduction of the experiment and consistency during cross-set migration.

[0079] Table 1 Feature Types

[0080]

[0081] (b2) For details on the input projection, Transformer encoder, learnable attention pooling layer and regression decoder set in the inversion model, please refer to (a1) to (a5) above.

[0082] (b3) The training process uses mean squared error (MSE) as the objective function to make the optimization objective isomorphic to evaluation metrics such as RMSE. The optimizer selected is AdamW (learning rate...). Weight decay Furthermore, ReduceLROnPlateau (a strategy for dynamically adjusting the learning rate in deep learning) is used to monitor the validation loss and adaptively reduce the learning rate to improve convergence quality and efficiency.

[0083] III. Experimentation and Optimization;

[0084] (b1) Experiment:

[0085] To monitor the training process, the convergence trajectories of training / validation loss and R² were tracked synchronously. As iterations progressed, the training loss monotonically decreased and then plateaued after several rounds, while the validation loss decreased synchronously with a small gap compared to the training loss, indicating that the model did not exhibit significant overfitting. Finally, on the independent validation set, the reported values ​​were MSE=0.3790, RMSE=0.6156, MAE=0.4407, MAPE=0.1088%, and R²=0.9675. Among these, RMSE... 2 ≈MSE (0.6156) 2 The R² value (≈0.3790) indicates good internal consistency; an R² of approximately 0.968 means the model explains nearly 96.8% of the variance, indicating a high overall goodness of fit. In terms of error scales, RMSE / MAE ≈ 1.40 suggests that the residual distribution is slightly heavier-tailed compared to the symmetric Laplace case, but is generally under control.

[0086] like Figure 2 As shown, Figure 2 (a) Midpoint cloud edge The lines are highly clustered, with slopes close to 1 and intercepts close to 0, and no systematic bends or "funnel-shaped" dispersions are observed, indicating that the model maintains good scale consistency and homoscedasticity across the entire range; a few outliers suggest that heavy-tailed errors still exist under extreme observation geometry or aerosol loading. Figure 2 (b) gives the absolute error | The right-skewed distribution shows that the median absolute error (MdAE) ≈ 0.333 ppm is significantly lower than the median absolute error (MAE) (0.4407 ppm), indicating that most samples fall within a smaller median level, and the mean is mainly driven up by a few larger errors. P90 / P95 represent 90% and 95% of the absolute errors being less than or equal to 0.9777 ppm and 1.2515 ppm, respectively.

[0087] In summary, the convergence of the training-validation curve, the low median absolute error MdAE, and the controlled P90 / P95 all demonstrate that the enhanced Transformer regression model effectively reduces prediction error without sacrificing stability, and possesses good generalization ability and engineering usability.

[0088] To further compare the inversion model of this embodiment with the general FT-Transformer regression model, a parallel validation comparison of this embodiment with the standard Transformer baseline model and the general FT-Transformer regression model is shown. All models are trained based on the same data partitioning and preprocessing procedures (Figure 3). According to the evaluation scheme, such as Figure 3 As shown in (a), the standard Transformer baseline model has a root mean square error (RMSE) of 1.3916 ppm, a mean square error (MSE) of 1.9366, and an R² value of 0.8309. The general FT-Transformer regression model, as a representative method for processing unordered tabular data, is designed to treat input labels as approximately invariant to permutations, and typically uses CLS-style pooling to aggregate feature representations. For fair comparison, such as... Figure 3 As shown in (b), the general FT-Transformer regression model trained under the same conditions has an RMSE of 0.7363 ppm, an MSE of 0.5421, and an R² value of 0.9525, providing a quantitative benchmark for evaluating the impact of self-attention mechanisms within the standard aggregation framework. In contrast, the inversion model in this embodiment explicitly incorporates the relative wavelength structure into the self-attention mechanism through Rotated Position Embedding (RoPE) and employs an attention-based pooling method to achieve content-adaptive representation aggregation. These design choices are more consistent with the ordered characteristics of spectral labeling and filtering channel subsets. On the same validation set, the improved Transformer (i.e., the inversion model) in this embodiment achieves superior performance, with a mean squared error (MSE) of 0.3790, a root mean squared error (RMSE) of 0.6156 ppm, and an R² value of 0.9675 (see Figure 2(a)). Overall, Figure 2 The schematic diagram in (a) provides an intuitive visual insight, while the quantitative metrics provide a rigorous basis for evaluating how the inversion model of this embodiment outperforms the general FT-Transformer regression model and the standard Transformer baseline model under a uniform experimental setting.

[0089] (b2) Ablation comparison experiment;

[0090] Under consistent data and training settings, as shown in Table 2, the model proposed in this embodiment (RoPE + Attention Pooling) has the following characteristics: Removing position encoding in Table 2 corresponds to removing position encoding from the model in this embodiment; removing attention pooling corresponds to removing attention pooling from the model in this embodiment; removing attention pooling and position encoding in Table 2 corresponds to removing both attention pooling and position encoding from the model in this embodiment; using learnable encoding in Table 2 simply means replacing the RoPE rotation encoding in this embodiment with learnable encoding, while keeping everything else unchanged.

[0091] The proposed model in this embodiment achieves optimal performance on metrics consistent with the optimization objective, with MSE and RMSE of 0.3790 and 0.6156, respectively, and a coefficient of determination R0. 2 The RSE reached 0.9675, with MAE and MAPE at 0.4407 and 0.1088%, respectively. When positional encoding was removed, the model's ability to represent sequence positional information significantly weakened, with RMSE rising to 0.6547 and R... 2 The value dropped to 0.9631, while MSE and MAE also increased (to 0.4287 and 0.4497, respectively), indicating that relative position information plays a key role in the temporal dependency characterization in this embodiment.

[0092] Further removal of attention pooling resulted in the most significant performance degradation: MSE and RMSE increased to 0.5100 and 0.7141, respectively, R 2 The MAE decreased to 0.9551, demonstrating that globally weighted aggregation is indispensable for extracting key information from the encoder output. Notably, when attention pooling and positional encoding are simultaneously removed, MAE and MAPE decrease slightly to 0.4357 and 0.1076%, respectively, but the variance-sensitive MSE and RMSE, along with the explanatory power R... 2 The performance remains significantly worse than the full configuration, with scores of 0.4051 and 0.6364 respectively, indicating that this phenomenon reflects more the smoothing of the error distribution than an improvement in the overall fitting quality. Replacing RoPE with learnable positional encoding also leads to a performance decline (RMSE of 0.6964, R...). 2 (The value is 0.9581). The comparison results show that explicit encoding of relative position information outperforms implicit capture of position signals through parameter learning alone. In summary, the complete design incorporating RoPE and attention pooling achieves an optimal balance between error magnitude and interpretability, demonstrating the complementarity and necessity of the two inductive biases in this regression task.

[0093] Table 2 Comparison of ablation test indicators

[0094]

[0095] (b3) Cross-domain adaptation and fine-tuning;

[0096] To test the model's temporal generalization ability under conditions that are closer to real business scenarios, this paper adopts a strict out-of-time independent evaluation design: the refined co-localization samples of OCO-2 and TCCON from 2015 to 2018 are used for model training and internal validation, while the entire year of 2019 is reserved as an independent test set for fine-tuning. In other words, the data from 2015 to 2018 is used for model training and validation, and the specific process is detailed above. The independent test set data from 2019 is used for fine-tuning here.

[0097] The complete isolation of training and testing along the timeline effectively avoids the leakage of implicit information caused by seasonal co-phase, duplicate observation sites, or overlapping samples, thus better reflecting the actual deployment scenario of "training with history for future application." This setup also jointly examines covariate drift caused by interannual background, emissions, and circulation pattern changes, as well as concept drift induced by differences in system calibration, meteorological conditions, and data coverage. Within this framework, if the inversion model can maintain stable accuracy and consistent calibration in 2019, it can be considered to have stronger robustness and transferability to out-of-time disturbances.

[0098] The construction of the independent test set during fine-tuning strictly followed the spatiotemporal co-localization and quality control process of the training data: satellite-ground pairing required a time window not exceeding ±30 minutes and a spatial radius not exceeding 0.5°, and was implemented in conjunction with cloud tagging, spectral quality, and geometric constraints; records that did not meet any of the conditions or had missing / anomalies were removed. Through this screening, 6728 high-quality paired samples were obtained in 2019, serving as the independent evaluation set used only for the final report. It is worth noting that compared to the training set from 2015–2018, the 2019 label distribution showed a significant interannual shift (the mean during training was approximately 405.12 ppm, while the mean in 2019 was approximately 409.69 ppm, and the standard deviation decreased from 3.40 ppm to 2.16 ppm). This further increased the testing difficulty and allowed for a more objective measurement of the model's adaptability to interannual background changes. All model selection and hyperparameter determination were limited to the training period (including internal cross-validation or subset reservation). No hyperparameter tuning was conducted in 2019, thus ensuring the independence and reproducibility of the evaluation.

[0099] To address the significant distribution shift phenomenon in cross-time OCO-2 / OCO-3 data, this embodiment proposes a modeling and adaptive fine-tuning framework combining "prior principal term - nonlinear residuals". The core idea is to fix the relatively stable linear relationship between the prior value (apriori) of XCO2 and the target variable as the interpretable principal term, and then have the neural network learn only the nonlinear residuals not covered by the principal term, and perform lightweight, strictly leak-free adaptive calibration on the target domain; thereby significantly improving cross-domain generalization performance while maintaining physical prior consistency.

[0100] Let the source domain dataset be The target domain is , and The first and The data includes solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, and physical quantities of cloud markers. and The first and Each data point corresponds to a priori value of XCO2. and The first and Labels for each data point.

[0101] in It includes physical quantities such as emissivity, solar-sensor geometry, albedo, surface air pressure, aerosols, and cloud markers. Let XCO2 be the prior value. For labels. Analysis shows that the target domain is in The mean and variance of the features exhibit a systematic shift relative to the source domain; the radiance components show heavy-tailed and extreme value characteristics, and the geometric angles exhibit significant periodicity. Direct regression on the original features often leads to normalization failure and optimization instability, resulting in significant performance degradation during migration.

[0102] First, a linear principal term of the prior value of XCO2 is fitted in the source domain using least squares: ;by A first-order interpretation of the label y with a fixed prior value characterizing XCO2:

[0103] in For tags, Let XCO2 be the prior value. and As a parameter, specifically, Represents the prior value of XCO2 With tags The linear scaling relationship between them reflects the average change in true XCO2 for every unit change in the prior value of XCO2. This parameter The model is obtained by least-squares fitting in the source domain and is assumed to remain stable across different observation conditions or geographical regions. It is the interpretable, cross-domain invariant deterministic part of the model. Indicates when prior value At that time, the baseline value of the true XCO2. In practical problems, the prior value... It is usually not zero, therefore Its main function is to correct system offset, and Together they constitute a first-order linear interpretation of the label.

[0104] Define residual : ;

[0105] Estimating residuals in the source domain Standardized parameters Based on this, the standardization target is obtained. : ;in, The linear fitting residual in the source domain The average value reflects the degree to which the linear principal term is systematically overestimated or underestimated in the source domain, i.e., the overall bias that the linear model fails to capture. During the standardization process, The removal of this component allows subsequent nonlinear corrections to focus on the fluctuating portion of the residual. It is the residual in the source domain The standard deviation of the residuals measures the magnitude of the residual variation after the linear main term explanation. It is used to scale the residuals so that the standardized residuals are... With zero mean and unit variance, it facilitates the model learning of cross-domain stable nonlinear modes. Furthermore, the standardized residuals... This represents the normalized residual signal that cannot be explained by the linear subject term.

[0106] Training phase only learning The nonlinear mapping is then linearly restored during the inference stage to obtain the predicted value. : ,in The predicted value of the standardized residual is obtained by a nonlinear model (such as a neural network) based on other physical quantities. The standardized residuals predicted (e.g., solar / sensor geometry, aerosols, cloud markers, etc.) represent complex nonlinear effects that the linear principal term failed to capture; because they have been standardized, The distribution is closer to a stable distribution with a mean of 0 and a variance of 1, thereby reducing the distributional differences between domains and making it easier to migrate from the source domain to the target domain.

[0107] Should The decomposition process of solving the global regression problem that is susceptible to domain shifts is simplified to "robust first-order principal term + transferable second-order correction".

[0108] in Corresponding to robust first-order principal terms, it reflects a stable global relationship between the prior and actual values ​​of XCO2, characterized by cross-domain fixity, strong interpretability, and resistance to domain shifts. This corresponds to a transferable second-order correction, derived from other physical quantities. Nonlinear residuals driven by factors such as solar / sensor geometry, aerosols, and cloud markers capture local complex effects; a key feature is that standardization enhances transferability, which is crucial for the model to adapt to different domains. The corresponding system bias recovery method remaps the standardized residuals back to the average level of the original residuals to ensure the unbiasedness of the prediction. Its characteristic is a constant offset specific to the source domain.

[0109] Feature engineering and consistency constraints: To mitigate numerical problems caused by scale and periodicity, the feature transformation employs a reversible transformation consistent with the physical mechanism and ensures training-inference consistency. A constraint is applied to the three emissivity channels. To suppress heavy tails and improve standardization; sine and cosine expansions were used for the azimuth and zenith angles of the sun and the sensor. The encoding period is used as an equivalence class; other physical quantities remain numerical and are uniformly interpolated and standardized using median. The order of feature columns is fixed and the prior value of XCO2 is placed in the last column to eliminate implicit bias caused by column alignment.

[0110] Residual Regression Submodel and Adaptive Fine-tuning: Residual regression uses a lightweight tabular Transformer, with inputs being imputed and standardized data. The output is Using mean squared error as the optimization objective, this method is only applicable to... Learn to avoid directly fitting the mean and scale that change with domain shift.

[0111] To address statistical drift in the target domain, a two-stage fine-tuning strategy is employed: the first stage freezes the Transformer encoder and updates only the output head, using a relatively large learning rate to coarsely calibrate the residual distribution in the target domain. The second stage unfreezes the entire inversion model, refining the parameters with a learning rate lower than the set lower threshold, and incorporating early stopping to suppress overfitting. Considering the differences in scale and noise of the XCO2 prior value across different time periods, a small-amplitude random relative perturbation or random replacement with the sample median is injected into the XCO2 prior value to enhance robustness to the uncertainty of the XCO2 prior value while ensuring the main term trend remains unchanged.

[0112] Two-layer nonlinear calibration: Although fine-tuned, systemic biases related to spatiotemporal or observation geometry may still remain. To further correct these biases, a two-layer calibrator is introduced on a subset of the target domain training data. A weak learner based on histogram gradient boosting regression is used to learn the error correction term. The calibrator input is... ,in The first layer prediction after two-stage fine-tuning outputs as follows: Final prediction:

[0113] ;

[0114] These are the features after two-layer nonlinear calibration.

[0115] Without compromising physical consistency, fine-grained data-driven correction of target domain residuals can significantly reduce RMSE and improve performance. .

[0116] From the start of the experiment, the target domain samples were divided into mutually exclusive training and retention subsets using a fixed random seed. For example... Figure 4 As shown in the figure, the horizontal axis decreases from 95% to 5% from left to right, representing the percentage of fine-tuning data; blue represents the baseline before fine-tuning, and red represents the result after fine-tuning. Overall, as the percentage of fine-tuned samples increases, the model performance initially improves rapidly, then reaches an "inflection point" around 30%: when the percentage exceeds 30%, the marginal benefit of continuing to add fine-tuning data becomes very limited, while training time and computational cost increase approximately linearly with the sample size. Based on a comprehensive trade-off between accuracy gains and computational costs, selecting approximately 30% of the data for fine-tuning achieves a better performance-efficiency balance, that is, obtaining robust improvements while avoiding unnecessary resource consumption.

[0117] All training and model selection for the two-stage fine-tuning and two-layer calibration (including the two-layer 5-fold out-of-fold validation, where out-of-fold prediction refers to the model's predictions on data that has never been used in training during cross-validation) are limited to a subset of the target domain training set; the final performance is reported only on the retained subset. To improve evaluation stability, the entire process is repeated for multiple random seeds, and the mean and standard deviation of the retained set metrics are statistically analyzed. The implementation follows the principles of "source domain determination, target domain adaptation, and consistent preprocessing": the interpolator and normalizer are fitted and frozen by the source domain; the logarithm of radiance and the sine and cosine of angles are consistent during both training and inference phases. and The learning rate is estimated from the source domain and reused in the fine-tuning and inference stages. The fine-tuning uses Adam (with a larger learning rate in the frozen stage and a smaller learning rate in the full stage, combined with early stopping). The second-layer calibration uses a gradient boosting regressor with early stopping verification. The input contains only standardized features and the first-layer prediction to avoid explicit or implicit label leakage.

[0118] Figure 5 and 6 The data points show the density scatter plots (hexbin) of "observed values ​​- predicted values" on the target domain retain set, with the three columns corresponding to random seeds 2025, 2026, and 2027, respectively. Figure 5 This is the base model (BASE) before fine-tuning. Figure 6 The model is stacked after employing "logarithmic emissivity transformation + residual modeling + two-stage fine-tuning + two-level calibration of the target domain". Color scales reflect local sample density, and the red dashed line is the 1:1 reference line. The information boxes in the figure provide the number of retained samples N and key statistics (coefficient of determination) for that column. Mean error (ME) and root mean square error (RMSE).

[0119] From an overall morphological perspective, before fine-tuning (top row), all three groups exhibited a "fan-shaped / band-shaped" distribution: the point cloud density was high near the mid-to-high value region (approximately 408–412 ppm), but there was a significant systematic deviation relative to the 1:1 line; "tailing" appeared at the low and high value ends, indicating that the model's fit to the dynamic range edges was unstable and that there was heteroscedasticity varying with the value range. After fine-tuning (bottom row), the density peak region significantly converged towards the 1:1 line, and the point cloud as a whole was compressed from a "band-shaped" distribution to a "narrow band" along the diagonal. The tailing at both the high and low ends was suppressed, showing consistent calibration across the entire range. This convergence showed a highly consistent trend across all three seed groups, indicating that the improvement did not depend on specific random partitioning or initialization.

[0120] (b4) Emergency Analysis;

[0121] As a stress test of the model's real-time predictive capabilities under extreme conditions, a sudden CO2 disturbance event was selected for evaluation. The focus was on analyzing the transit observations of OCO-2 as it passed near the fire area, and comparing the model's inversion results with measurements from a reference tower.

[0122] Figure 8 The analysis results of this event are summarized. At the transit point on May 10, 2023, after completing all training and fine-tuning (and without using any additional parameter tuning with 2023 data to realistically simulate the forward prediction scenario), the model's predicted column-average CO2 (XCO2) was compared with the CO2 measured synchronously at the WGC tower. The model's predicted XCO2 was approximately X moThe measured value of WGC is d (ppm), while the actual measured value of WGC is X_WGC (ppm). The absolute difference between the two is approximately |X_WGC(ppm). mo d–X_WGC| ≈3.3ppm, with a relative error of approximately 0–0.8%. In other words, even under anomalous perturbations caused by wildfires, the model's instantaneous error remains less than 1%, demonstrating excellent performance. This indicates that the model maintains its sensitivity to anomalous signals without significant drift when encountering situations outside the training distribution.

[0123] The model's predicted values ​​were further examined within the WGC measurement distribution on that day. The WGC exhibited a certain diurnal variation on May 10th; the model's predicted values ​​fell at a relatively low point in the daytime air column concentration distribution for that tower (this is reasonable, as satellite observations of XCO2 include free tropospheric information, while ground-based tower measurements represent boundary layer concentrations, which may be higher due to surface emissions). As an additional reference, the daily average CO2 concentration at the Mauna Loa site was also incorporated (this site had resumed normal operations by this time). The difference between the model's predictions and the Mauna Loa daily average was approximately –3.38 ppm (≈0.80%), similar in magnitude to its deviation relative to the WGC. Considering regional and sampling differences (Mauna Loa represents global background values ​​far from the fire, while the WGC is regional and a near-surface observation), the model's results show reasonable agreement with both.

[0124] To further analyze the model's response behavior, without changing the model weights or retraining, it was applied to OCO-2 satellite data for two days before and after the wildfire. The predicted locations of the WGC transit were compared on May 9th, 10th, and 11th, 2023 (the event days). The increment from the previous day to the event day is denoted as Δ. -24h ≈ (XCO2 of 5 / 10) – (XCO2 of 5 / 09), with the increment relative to the next day denoted as Δ. +24h ≈ (5 / 10 of XCO2) – (5 / 11 of XCO2). Both increments are positive and similar in magnitude (approximately on the order of 0 to (1–2) ppm). The increment on the event day relative to the average of the two days before and after was also calculated. The results show that although the background CO2 was basically stable on the two days before and after (concentrations were almost the same), there was a clear upward jump on the event day itself. The model successfully captured this jump, producing a higher value on May 10 than the two days before and after, reflecting a significant anomalous enhancement. The direction of this increase is correct and the magnitude is reasonable, indicating that the model has an interpretable and physically reasonable response to the atmospheric CO2 anomaly caused by the wildfire plume.

[0125] In summary, this wildfire case study demonstrates that even without any retraining using 2023 data, the model can still detect and quantify anomalous CO2 increases with high fidelity. The instantaneous relative error of approximately 0.8% at the event peak, and the accurate characterization of CO2 enhancement on the event day relative to adjacent dates, demonstrate the model's good transferability and stability under extreme, out-of-sample distribution conditions. This further enhances the model's reliability for real-time monitoring of sudden emission events in practical scenarios.

[0126] IV. Comparison of this embodiment with existing methods;

[0127] This embodiment uses OCO-2 satellite data and targets ground data for inversion. Compared with traditional algorithms, such as those using inversion algorithms based on nonlinear iterative optimization theory (reference: Xie F, Ren T, Zhao C, Wen Y, Gu Y, Zhou M, Wang P. Fast retrieval of XCO2 over East Asia based on the OCO-2 spectral measurements. Atmosp Meas Techniq Discuss. 2023;17(13):3949–3967), this embodiment can significantly improve the speed and correlation with ground data. Figure 7 As shown, Figure 7 Figures (a) and (b) are based on screened OCO-2 satellite data from 2018–2019: the unlearned baseline correlation yielded only R²=0.265, while the fitted coefficient of determination of the fine-tuned model with TCCON increased to R²=0.963, indicating that the proposed method significantly enhances cross-platform consistency and effectively suppresses systematic and random errors.

[0128] Regarding inference efficiency, a batch prediction experiment on 6728 samples showed that the total time was 3.463 s on an RTX 5070 (graphics card) (≈1,942.8 samples / s, ≈0.515 ms / sample), and 2.444 s on an R7-9700X (≈2,752.9 samples / s, ≈0.363 ms / sample). This result demonstrates that, under the same model and inference settings, the CPU can achieve millisecond-level latency and high throughput, and the model does not rely on the GPU to meet the requirements of lightweight and near-real-time applications; thus enabling near-real-time inversion computation on satellite.

[0129] Therefore, this embodiment systematically evaluates the CO2 inversion capability under complex atmospheric backgrounds, focusing on data construction and an enhanced Transformer regression framework based on strong satellite-ground co-location. By pairing OCO-2 satellite features (spectral and geometric characteristics) with TCCON ground observations within a strict spatiotemporal window (±30 minutes, ≤0.5°), and training the inversion model through consistent preprocessing and rigorous leak-free evaluation, the results show that the proposed combination of "prior principal term-nonlinear residual" with RoPE location encoding and attention pooling can stably reduce errors and improve interpretability. In metrics consistent with the optimization objective, the inversion model achieves sub-ppm RMSE and high performance in validation / testing. (e.g., RMSE≈0.62ppm, R) 2 The value is on the order of ≈0.97, indicating that this embodiment can effectively capture nonlinear residual structures while maintaining physical prior consistency, thereby improving the prediction accuracy of column-averaged CO2.

[0130] Based on the above description of the embodiments, those skilled in the art will understand that the deep learning CO2 near real-time inversion method and system integrating multi-scale features described in this embodiment can be implemented in pure software or deployed and run on a general-purpose or spaceborne computer. Based on this essence, the technical solution of this embodiment can be specifically implemented in the form of a software product containing program instructions. This software product can be stored on various payload storage devices or directly deployed as a local or cloud service. The program instructions are used to cause computer devices with processing capabilities—including but not limited to spaceborne computers, personal computers, server clusters, mobile terminals, or other network devices—to execute the steps described in this embodiment.

[0131] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A deep learning-based near real-time CO2 inversion method integrating multi-scale features, characterized in that, By inputting the observed spectral data into the inversion model, the satellite data inversion results can be calculated in real time near orbit and output. The inversion model includes a two-level fully connected projection, a Transformer encoder, a learnable attention pooling, and a regression decoder connected in sequence. Each layer of the encoder includes two sequentially connected sub-layers: a multi-head self-attention layer and a feedforward network layer. Each sub-layer is followed by a residual connection layer and a normalization layer. In the multi-head self-attention layer, rotational position encoding is applied to the query Q and key K before entering the dot product attention. The relative displacement is implicitly injected into the attention weight calculation through phase rotation. The learnable attention pooling is used to adaptively aggregate the encoded features of the encoder output in the sequence dimension, and the aggregated information is as follows: Record No. The hidden state at each time step is Scalar scores are obtained through nonlinear extraction and data compression. The weights are obtained by softmax along the sequence dimension. And then The context vector is obtained by weighted summation. , as aggregated information; The training process of the inversion model is as follows: A training set was constructed using the spectral data of OCO-2 as input features and in-situ ground observations as labels. The training set is projected through a two-level fully connected layer and then input into the Transformer encoder. After applying rotational position encoding to the query Q and key K, dot product attention is calculated. The encoded features output by the encoder are adaptively aggregated in the sequence dimension through a learnable attention pooling layer, and then passed through a regression decoder and the output head to obtain the prediction result. Construct a loss function to adjust the trainable parameters of the inversion model; The trained inversion model is fine-tuned as follows: Construct a dataset for model fine-tuning, and fix the stable linear relationship between the prior value of XCO2 and the target variable as an interpretable principal term; The inversion model learns only the nonlinear residuals not covered by the principal term and performs adaptive calibration on the target domain; The method of fixing the stable linear relationship between the prior value of XCO2 and the target variable as the interpretable main term specifically involves: Let the source domain dataset be The target domain is , and The first and The data includes solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, and physical quantities of cloud markers. and The first and Each data point corresponds to a priori value of XCO2. and The first and Labels for each data item; Linear principal terms of the prior values ​​of XCO2 are fitted in the source domain using least squares: ,by A first-order interpretation of the label y is given by fixing the prior values ​​characterizing XCO2, where For tags, and As a parameter, The prior value of XCO2; Define residual : ; Estimating residuals in the source domain Standardized parameters Based on this, the standardized residuals are obtained. : , The linear fitting residual in the source domain The mean and standard deviation; Training phase only learning The nonlinear mapping is then linearly restored during the inference stage to obtain the predicted value. : ,in The predicted value is the standardized residual.

2. The method according to claim 1, characterized in that, In the encoder, rotational position encoding is applied to Q and K before entering the dot product attention, specifically as follows: After applying rotational position encoding to Q and K respectively, matrix multiplication is performed. The output features after passing through the scaling layer and activation layer are then multiplied with V to complete the dot product attention calculation.

3. The method according to claim 1, characterized in that, The construction of the training set specifically includes: A cross-validation framework was constructed using the OCO-2 spectrum as input features and ground-based in-situ observations as labels; A dataset is constructed by setting co-localization conditions with "dual threshold" constraints on input features and labels. The co-localization conditions include a window interval of no more than ±A minutes and a spatial distance of no more than B degrees, where A and B are positive numbers. The dataset is divided into a training set, a validation set, and a test set, with the test set isolated from the training set and the validation set on the time axis, respectively. The original spectral channels in the dataset were filtered out based on radiative transfer interpretability and signal-to-noise ratio; A set of auxiliary scalar features are introduced: solar zenith angle and azimuth angle, sensor zenith angle and azimuth angle, surface air pressure, total aerosol optical thickness, cloud markers, and prior values ​​of XCO2. Independent Gaussian noise with a mean of 0 and a standard deviation of 0.02 is injected into the observed features in the training set, but no noise is added to the labels; no noise is injected into the validation set and the test set. The labels participate in the loss at the physical quantity scale during training, and are denormalized back to the original physical units during the validation and visualization phases. Fit the mean on the training set with standard deviation and the same group It is applied unbiasedly to the validation and test sets.

4. The method according to claim 1, characterized in that, The adaptive calibration performed on the target domain specifically includes: The first stage is to freeze the Transformer encoder and only update the output head to complete the coarse calibration of the residual distribution in the target domain with a learning rate less than the set upper limit threshold. The second stage unfreezes the entire inversion model, refines the parameters with a learning rate lower than the set learning rate threshold, and uses early stopping to suppress overfitting. Two-layer nonlinear calibration: A two-layer calibrator is introduced on a subset of the target domain training data. A weak learner based on histogram gradient boosting regression is used to learn the error correction term. The calibrator input is , This is the first layer of prediction after two-stage fine-tuning. The input features are after interpolation and standardization.

5. The method according to claim 4, characterized in that, During the fine-tuning process, a mechanism is used to inject random relative perturbations or randomly replace the prior value of XCO2 with the sample median, while ensuring that the trend of the main term remains unchanged.

6. A computer system comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the method according to any one of claims 1-5.