Method for reconstructing high spatial resolution seawater carbon dioxide partial pressure based on multi-source data

By preprocessing multi-source data and parallel processing using the XGBoost model, the problem of low spatiotemporal resolution of seawater surface carbon dioxide partial pressure data was solved, enabling efficient reconstruction and accurate monitoring of marine carbon sinks.

CN115758074BActive Publication Date: 2026-06-19ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2022-11-28
Publication Date
2026-06-19

Smart Images

  • Figure CN115758074B_ABST
    Figure CN115758074B_ABST
Patent Text Reader

Abstract

This invention discloses a parallel reconstruction method, apparatus, and medium for high spatial resolution monthly sea surface carbon dioxide partial pressure data based on multi-source data and machine learning. The method employs different parallel strategies for multi-source large datasets with high spatial resolution. First, the time-series observation data is weighted and spatiotemporally averaged and gridded to obtain a monthly gridded sample set. Then, feature data from remote sensing, modeling, and reanalysis are preprocessed to construct a multi-source feature dataset. A sample weight model is constructed using sample frequency, and an XGBoost tree model is established and optimized for feature selection and learning. Finally, the trained model is used to reconstruct the sea surface carbon dioxide partial pressure gridded data. The advantages of this invention lie in leveraging the high spatial resolution of remote sensing data, deeply integrating multi-source data and modern machine learning methods, and developing flexible parallel and learning strategies for imbalanced large datasets. This invention has significant practical application value for understanding the spatiotemporal distribution and trends of sea surface carbon dioxide partial pressure, and for understanding and developing marine carbon sinks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of marine remote sensing, specifically relating to a parallel reconstruction method for monthly high spatial resolution sea surface carbon dioxide partial pressure data based on multi-source data and machine learning. Background Technology

[0002] With the increasing emissions of fossil fuels and deforestation, human activities have significantly impacted the carbon cycle, profoundly affecting global climate and the biochemical environment. The ocean, a vital carbon sink in nature, has become a hot topic in marine science research due to its capacity to absorb carbon dioxide from the atmosphere. The net flux of carbon dioxide from the atmosphere into the ocean through gas exchange depends on wind speed and the difference between the partial pressure of carbon dioxide in the atmosphere and the partial pressure of carbon dioxide at the sea surface. However, since measurements of the partial pressure of carbon dioxide at the sea surface largely rely on ship-based and in-situ measurements, they are constrained by fixed routes, voyages, and locations, resulting in low spatiotemporal resolution and uneven distribution, exhibiting weak temporal and spatial consistency. This leads to significant uncertainties and challenges in estimating marine carbon sinks. With the development of software and hardware and the advent of the big data era, the emergence of high spatial resolution remote sensing data has made it possible to reconstruct the partial pressure of carbon dioxide at the sea surface at fine scales, enabling more accurate monitoring and analysis of marine carbon sinks. Simultaneously, the high spatiotemporal resolution has led to a surge in the volume of satellite remote sensing data, making it difficult for traditional analysis methods to effectively extract the value corresponding to its sheer volume. Conversely, for machine learning, more data is more likely to improve model accuracy; therefore, machine learning is indispensable for large datasets. Furthermore, large datasets require flexible parallel processing strategies to reduce time costs.

[0003] However, how to apply machine learning to reconstruct the partial pressure data of carbon dioxide in the sea surface obtained from ocean observations in parallel is a technical problem that urgently needs to be solved. Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of existing methods and to provide a parallel reconstruction method for high spatial resolution monthly sea surface carbon dioxide partial pressure data based on multi-source data and machine learning.

[0005] To achieve the above-mentioned objectives, the present invention provides the following specific technical solution:

[0006] A high spatial resolution method for reconstructing seawater carbon dioxide partial pressure based on multi-source data, comprising the following steps:

[0007] S1: Acquire reanalysis and model data and remote sensing data related to seawater carbon dioxide partial pressure. Preprocess the acquired multi-source feature data. First, upsample the reanalysis and model data to the spatial resolution of the remote sensing data. Second, impute missing values ​​in the remote sensing data. Finally, add spatial and temporal features and perform numerical transformation on some features to make them spatially continuous, thus completing the preprocessing to form a feature dataset. During the preprocessing of the feature dataset, divide the data into parallel sub-tasks based on monthly data patches and remove invalid values ​​in advance to reduce the amount of data, enabling parallel processing between different sub-tasks.

[0008] S2: Based on the measured observation data of carbon dioxide partial pressure at sea surface, all collection cruises were gridded using a route-weighted average method, with the grid accuracy matching the spatial resolution of the remote sensing data, thus obtaining a monthly observation sample set of carbon dioxide partial pressure at sea surface. Then, the samples in the monthly observation sample set were spatially matched with the feature dataset to obtain a sample feature set. In the process of constructing the monthly observation sample set and the sample feature set, parallel sub-tasks were divided on a monthly basis based on the observation time to achieve parallel processing between different sub-tasks.

[0009] S3: Based on spatiotemporal scale and statistical analysis, sample mixing inverse frequency weights are established for XGBoost model training, and features in the feature dataset are selected according to the feature importance of the pre-trained model; then, based on the selected features, the XGBoost model is further optimized using hyperparameter search and 10-fold cross-validation.

[0010] S4: Using the trained XGBoost model, predict and reconstruct the monthly surface carbon dioxide partial pressure distribution of seawater at high spatial resolution; and divide the prediction process into parallel sub-tasks based on patches divided by month and meridian, so as to achieve parallel processing between different sub-tasks.

[0011] Based on the above technical solution, each step is preferably implemented in the following specific ways. The preferred implementation methods for each step can be combined appropriately without conflict, and this does not constitute a limitation.

[0012] Preferably, the specific method for forming the feature dataset through preprocessing in step S1 is as follows:

[0013] S11: Using model and reanalysis data that include sea surface salinity, chlorophyll a, seawater mixed layer depth, dry atmospheric carbon dioxide mole fraction, and sea surface wind speed at 10 meters per second as feature data, and remote sensing data that includes sea surface temperature as feature data, the various model and reanalysis data are upsampled to the spatial resolution of remote sensing data and the spatial coordinate system is unified. At the same time, the temporal resolution of each feature data is unified to monthly, and the daily data is converted to the temporal resolution by calculating the monthly average.

[0014] S12: Locate missing values ​​(NaN) in the sea surface temperature remote sensing data and fill them in using the corresponding values ​​from the full coverage reanalysis data;

[0015] S13: Add latitude and longitude as spatial features and month as time features to the feature dataset; perform trigonometric transformation on longitude and logarithmic transformation on seawater mixing layer depth and chlorophyll concentration a to make the two features spatially smooth and continuous.

[0016] Preferably, during the construction of the feature dataset, for monthly single-time data, the average number of patches is used as the smallest unit for dividing parallel sub-tasks for parallel processing. The process pool is used to remove land and other invalid grids in parallel, and the effective units are converted from matrices to vectors and the row and column numbers of each unit in the original matrix are recorded.

[0017] Preferably, the specific method of step S2 is as follows:

[0018] S21: The measured sea surface carbon dioxide partial pressure dataset is divided into a series of subsets according to three dimensions: year, month, and spatial grid. The precision of the spatial grid is the same as that of the remote sensing data. The observation years in the measured sea surface carbon dioxide partial pressure dataset are... The observation month is Furthermore, the data in the grid cell located in the r-th row and c-th column of the spatial grid is a subset of the dataset. For a subset of datasets There are a total of C routes { Calculate the average value of the observation samples from each cruise if | j=(1, 2,…, C)}:

[0019]

[0020] In the formula: Indicates flight route The total number of observation points on the platform Indicates flight route The i-th observation point on;

[0021] S22: Perform a second average on the average of all flights in a subset of data, and the resulting average... The sample values ​​of the spatial grid corresponding to this subset of data:

[0022]

[0023] After dividing all the subsets of the measured carbon dioxide partial pressure data at the sea surface into sub-tasks on a monthly basis, the calculations in S21 and S22 were completed through parallel processing to obtain a monthly observation sample set of carbon dioxide partial pressure at the sea surface with the same grid accuracy and spatial resolution as the remote sensing data.

[0024] S23: Divide the data in the monthly observation sample set into subsets on a monthly basis, and use a process pool to process the spatial matching process between each subset and the feature dataset in parallel to obtain a sample feature set; each sample in the obtained sample feature set contains multiple feature data and observation data of sea surface carbon dioxide partial pressure, which are used to train the XGBoost model.

[0025] Preferably, the specific method of step S3 is as follows:

[0026] S31: Based on spatiotemporal scales and statistical analysis, calculate the reciprocal of the sample density within the block to which the sample belongs in the three dimensions of time, space, and value range to establish the sample mixture inverse frequency weight; where the sample mixture inverse frequency weight Defined as:

[0027]

[0028] in, , , These represent the spatial weight, temporal weight, and statistical value weight of the sample, respectively, and δ represents the weight scaling factor.

[0029] The spatial weight calculation method for the samples is as follows: the monthly observation dataset is divided into subsets according to year, month, and spatial grid. ,but Spatial weights of in-samples The calculation formula is determined by the sample size of this subset of data:

[0030]

[0031] In the formula: It is a counting function;

[0032] The time weighting method for the samples is as follows: the monthly observation dataset is divided into sub-datasets according to the month and spatial grid. Then time weight The value is determined by the number of grid cells in the corresponding month's subset, and the calculation formula is:

[0033]

[0034] The method for calculating the statistical weights of samples is as follows: Define the bias 'd' to describe the interval of the sample value within the overall sample set's value range, and divide the monthly observation dataset into subsets based on 'd'. Statistical value weights The number of samples in the dataset determines the calculation formula:

[0035]

[0036] S32: Based on the establishment of sample mixing inverse frequency weights and the feature data and observation data in the samples, the XGBoost model is pre-trained multiple times. Based on the average result of multiple pre-training, the number of times all features except spatiotemporal features are used as split tree nodes in the model is used as the evaluation criterion to filter out feature elements with lower importance and retain key feature elements as model input.

[0037] S33: Ten-fold cross-validation is used to perform grid search on the hyperparameters of the XGBoost model. Taking into account hardware and computational costs, the model hyperparameters that enable the model to achieve the target accuracy are selected.

[0038] S34: Combine the sample feature set obtained in S23, the sample mixing inverse frequency weights in S31, the feature elements selected in S32, and the model hyperparameters in S33 to train the XGBoost model under multiple random seeds, resulting in multiple XGBoost models for prediction.

[0039] Preferably, the specific method of step S4 is as follows: In step S33, the hyperparameters of the model for grid search include the maximum tree depth and the minimum leaf weight.

[0040] Preferably, the specific method of step S4 is as follows:

[0041] For the model prediction task of monthly sea surface carbon dioxide partial pressure distribution, the task is divided into parallel subtasks based on monthly and meridional patches. The subtasks are processed in parallel. During the processing, each trained XGBoost model in S34 is used to predict the monthly sea surface carbon dioxide partial pressure distribution based on the feature data corresponding to each subtask. The average of the prediction results of all XGBoost models is taken as the final reconstruction result, thus reconstructing the monthly sea surface carbon dioxide partial pressure distribution with high spatial resolution.

[0042] As a preferred approach, when executing the model prediction subtask in parallel, a counter is used to record the prediction progress of the patches. When all tasks are completed within a single time interval, the matrix restoration task is automatically submitted, that is, the prediction result vector is restored into a matrix based on the row and column numbers.

[0043] Compared with the prior art, the present invention has the following advantages:

[0044] This invention employs different parallel strategies for multi-source big data with high spatial resolution. First, the time-series observation data is weighted and spatiotemporally averaged and gridded to obtain a monthly gridded sample set. Then, feature data from remote sensing, modeling, and reanalysis are preprocessed to construct a multi-source feature dataset. A sample weight model is built using sample frequency, and an XGBoost tree model is established and optimized for feature selection and learning. Finally, the trained model is used to reconstruct the sea surface carbon dioxide partial pressure gridded data. The advantages of this invention lie in leveraging the high spatial resolution of remote sensing data, deeply integrating multi-source data and modern machine learning methods, and developing flexible parallel and learning strategies for imbalanced large datasets. This invention has significant practical application value for understanding the spatiotemporal distribution and trends of sea surface carbon dioxide partial pressure, and for understanding and developing marine carbon sinks. Attached Figure Description

[0045] Figure 1 Flowchart of a parallel reconstruction method for high spatial resolution monthly sea surface carbon dioxide partial pressure data based on multi-source data and machine learning;

[0046] Figure 2 This is a schematic diagram of the nearest neighbor resampling method;

[0047] Figure 3 Parallel flowchart for gridding observation samples;

[0048] Figure 4 A flowchart for the parallel process of preprocessing feature sets and predicting reconstruction;

[0049] Figure 5 A statistical chart showing the average results of the importance of each feature in the pre-trained model;

[0050] Figure 6 A statistical graph showing the average results of the importance of each feature in the training model;

[0051] Figure 7 This is a statistical graph showing the model's fitting accuracy in each year. Detailed Implementation

[0052] The present invention will be further described and illustrated below with reference to the accompanying drawings and specific embodiments.

[0053] like Figure 1 As shown, in a preferred embodiment of the present invention, a method for reconstructing the partial pressure of carbon dioxide in seawater with high spatial resolution based on multi-source data is provided, the steps of which are as follows:

[0054] S1: Acquire reanalysis and model data and remote sensing data related to seawater carbon dioxide partial pressure. Preprocess the acquired multi-source feature data. First, upsample the reanalysis and model data to the spatial resolution of the remote sensing data. Second, impute missing values ​​in the remote sensing data. Finally, add spatial and temporal features and perform numerical transformation on some features to make them spatially continuous, thereby completing the preprocessing to form a feature dataset. During the preprocessing of the feature dataset, divide the data into parallel sub-tasks based on monthly data patches and remove invalid values ​​in advance to reduce the amount of data, thus achieving parallel processing between different sub-tasks.

[0055] In an embodiment of the present invention, the specific method for forming the feature dataset through preprocessing in step S1 above is as follows:

[0056] S11: Using model and reanalysis data that include sea surface salinity, chlorophyll a, seawater mixed layer depth, dry atmospheric carbon dioxide mole fraction, and sea surface wind speed at 10 meters per second as feature data, and remote sensing data that includes sea surface temperature as feature data, the various model and reanalysis data are upsampled to the spatial resolution of remote sensing data and the spatial coordinate system is unified. At the same time, the temporal resolution of each feature data is unified to monthly, and the daily data is converted to the temporal resolution by calculating the monthly average.

[0057] S12: Locate missing values ​​(NaN) in the sea surface temperature remote sensing data and fill them in using the corresponding values ​​from the full coverage reanalysis data;

[0058] S13: Add latitude and longitude as spatial features and month as time features to the feature dataset; perform trigonometric transformation on longitude and logarithmic transformation on seawater mixing layer depth and chlorophyll concentration a to make the two features spatially smooth and continuous.

[0059] In addition, in order to accelerate the efficiency of processing large amounts of data, during the construction of the feature dataset, for monthly single-time data, the average divided patches are used as the smallest unit for dividing parallel subtasks for parallel processing. The process pool is used to remove land and other invalid grids in parallel, and the effective units are converted from matrices to vectors and the row and column numbers of each unit in the original matrix are recorded.

[0060] S2: Based on the measured observation data of carbon dioxide partial pressure at sea surface, all collection cruises are gridded using a weighted average method along the route, with the grid accuracy matching the spatial resolution of the remote sensing data, thus obtaining a monthly observation sample set of carbon dioxide partial pressure at sea surface. Then, the samples in the monthly observation sample set are spatially matched with the feature dataset to obtain a sample feature set. In the process of constructing the monthly observation sample set and the sample feature set, parallel sub-tasks are divided on a monthly basis based on the observation time to achieve parallel processing between different sub-tasks.

[0061] In an embodiment of the present invention, the specific method of step S2 described above is as follows:

[0062] S21: The measured sea surface carbon dioxide partial pressure dataset is divided into a series of subsets according to three dimensions: year, month, and spatial grid. The precision of the spatial grid is the same as that of the remote sensing data. The observation years in the measured sea surface carbon dioxide partial pressure dataset are... The observation month is Furthermore, the data in the grid cell located in the r-th row and c-th column of the spatial grid is a subset of the dataset. For a subset of datasets There are a total of C routes { Calculate the average value of the observation samples from each cruise if | j=(1, 2,…, C)}:

[0063]

[0064] In the formula: Indicates flight route The total number of observation points on the platform Indicates flight route The i-th observation point on;

[0065] S22: Perform a second average on the average of all flights in a subset of data, and the resulting average... The sample values ​​of the spatial grid corresponding to this subset of data:

[0066]

[0067] After dividing all the subsets of the measured carbon dioxide partial pressure data at the sea surface into sub-tasks on a monthly basis, the calculations in S21 and S22 were completed through parallel processing to obtain a monthly observation sample set of carbon dioxide partial pressure at the sea surface with the same grid accuracy and spatial resolution as the remote sensing data.

[0068] S23: Divide the data in the monthly observation sample set into subsets on a monthly basis, and use a process pool to process the spatial matching process between each subset and the feature dataset in parallel to obtain a sample feature set; each sample in the obtained sample feature set contains multiple feature data and observation data of sea surface carbon dioxide partial pressure, which are used to train the XGBoost model.

[0069] S3: Based on spatiotemporal scale and statistical analysis, sample mixing inverse frequency weights are established for XGBoost model training, and features in the feature dataset are selected according to the feature importance of the pre-trained model; then, based on the selected features, the XGBoost model is further optimized using hyperparameter search and ten-fold cross-validation.

[0070] In an embodiment of the present invention, the specific method of step S3 above is as follows:

[0071] S31: Based on spatiotemporal scales and statistical analysis, calculate the reciprocal of the sample density within the block to which the sample belongs in the three dimensions of time, space, and value range to establish the sample mixture inverse frequency weight; where the sample mixture inverse frequency weight Defined as:

[0072]

[0073] in, , , These represent the spatial weight, temporal weight, and statistical value weight of the sample, respectively, and δ represents the weight scaling factor.

[0074] The spatial weight calculation method for the samples is as follows: the monthly observation dataset is divided into subsets according to year, month, and spatial grid. ,but Spatial weights of in-samples The calculation formula is determined by the sample size of this subset of data:

[0075]

[0076] In the formula: It is a counting function;

[0077] The time weighting method for the samples is as follows: the monthly observation dataset is divided into sub-datasets according to the month and spatial grid. Then time weight The value is determined by the number of grid cells in the corresponding month's subset, and the calculation formula is:

[0078]

[0079] The method for calculating the statistical weights of samples is as follows: Define the bias 'd' to describe the interval of the sample value within the overall sample set's value range, and divide the monthly observation dataset into subsets based on 'd'. Statistical value weights The number of samples in the dataset determines the calculation formula:

[0080]

[0081] S32: Based on the establishment of sample mixing inverse frequency weights and the feature data and observation data in the samples, the XGBoost model is pre-trained multiple times. Based on the average result of multiple pre-training, the number of times all features except spatiotemporal features are used as split tree nodes in the model is used as the evaluation criterion to filter out feature elements with lower importance and retain key feature elements as model input.

[0082] S33: Ten-fold cross-validation is used to perform a grid search on the hyperparameters of the XGBoost model. Taking into account hardware and computational costs, the hyperparameters that enable the model to achieve the target accuracy are selected. In this embodiment, the hyperparameters used for grid search include the maximum tree depth and the minimum leaf weight.

[0083] S34: Combine the sample feature set obtained in S23, the sample mixing inverse frequency weights in S31, the feature elements selected in S32, and the model hyperparameters in S33 to train the XGBoost model under multiple random seeds, resulting in multiple XGBoost models for prediction.

[0084] S4: Using the trained XGBoost model, predict and reconstruct the monthly surface carbon dioxide partial pressure distribution of seawater at high spatial resolution; and divide the prediction process into parallel sub-tasks based on patches divided by month and meridian, so as to achieve parallel processing between different sub-tasks.

[0085] In an embodiment of the present invention, the specific method of step S4 above is as follows:

[0086] For the model prediction task of monthly sea surface carbon dioxide partial pressure distribution, the task is divided into parallel subtasks based on monthly and meridional patches. The subtasks are processed in parallel. During the processing, each trained XGBoost model in S34 is used to predict the monthly sea surface carbon dioxide partial pressure distribution based on the feature data corresponding to each subtask. The average of the prediction results of all XGBoost models is taken as the final reconstruction result, thus reconstructing the monthly sea surface carbon dioxide partial pressure distribution with high spatial resolution.

[0087] When executing the model prediction subtask in parallel, a counter is used to record the prediction progress of the patches. When all tasks are completed within a single time interval, the matrix restoration task is automatically submitted, that is, the prediction result vector is restored into a matrix according to the row and column numbers.

[0088] The following specific embodiment will demonstrate the implementation method, technical effects, and principles of the above method.

[0089] Example

[0090] In this embodiment, the parallel reconstruction method based on high spatial resolution monthly sea surface carbon dioxide partial pressure data using multi-source data and machine learning mainly includes four steps, namely steps 1 to 4:

[0091] Step 1: Preprocess the multi-source feature data. First, the reanalysis and model data are upsampled to the same spatial resolution as the remote sensing data. Second, missing values ​​in the sea level temperature remote sensing data are filled in. Finally, spatiotemporal features are added, and some features are numerically transformed. During the process, parallel sub-tasks are divided into monthly data patches, and invalid values ​​are removed in advance to reduce the amount of data.

[0092] In this embodiment, the multi-source data is preprocessed according to steps 11 to 15, and a multi-source feature dataset is constructed in parallel:

[0093] Step 11: Upsample model and reanalysis data such as sea surface salinity, chlorophyll a, seawater mixed layer depth, dry atmospheric carbon dioxide mole fraction, and sea surface wind speed (10m) to the spatial resolution of sea surface temperature and optical properties remote sensing data, and unify the spatial coordinate system. Simultaneously, calculate monthly averages for diurnal data to unify the temporal resolution. In this example, the nearest neighbor pixel method is used for upsampling. Figure 2 That is, the value of a grid cell is determined by the value of the cell whose center point is the smallest among the four adjacent grid cells;

[0094] Step 12: Fill in the missing values ​​(NaN) in the sea surface temperature remote sensing data using the corresponding values ​​from the full coverage reanalysis data;

[0095] Step 13: In the feature dataset, add latitude and longitude as spatial features and months as temporal features; on a global scale, perform trigonometric function transformation on longitude and logarithmic transformation on seawater mixing layer depth and chlorophyll concentration a.

[0096] Step 14: For monthly single-time data, divide it into 20 patches along the meridian on average. Use the process pool to remove land and other invalid grids in parallel using the smallest patch unit. Convert the valid units from matrix to vector form and record the row and column number of each unit in the original matrix.

[0097] Step 2: Correct the sea surface carbon dioxide fugacity data to obtain the partial pressure value, divide it into parallel sub-tasks on a monthly basis, and use the route weighted average method to grid the data to obtain the monthly observation sample set. Then, perform spatial matching with the feature dataset to obtain the sample feature set.

[0098] Since the actual measured data may be seawater surface carbon dioxide fugacity data, it can be converted into carbon dioxide partial pressure using the following formula:

[0099]

[0100] in, This indicates the partial pressure of carbon dioxide in the sea surface layer; P represents the carbon dioxide fugacity in the sea surface layer. atmThe pressure is atmospheric (Pa); R is the ideal gas constant, with a value of 8.314 (J·mol⁻¹). -1 ·K -1 B and δ represent correction factors (m) related to temperature T (K). 3 ·mol -1 The calculation method is as follows:

[0101]

[0102]

[0103] Where, b0 = -1636.75, b1 = 12.0408, b2 = -3.27957 × 10 -2 b3 = 3.16528 × 10 -5 .

[0104] Based on this, the monthly sample feature set is constructed in parallel according to steps 21 and 22:

[0105] Step 21: Correct the sea surface carbon dioxide fugacity observation dataset to partial pressure values, and divide the dataset into sub-datasets by year, month, and high spatial resolution grid. Then, based on the C flight routes collected from the samples within the dataset, it is divided into { | j=(1, 2, …, C)}, and calculate the average value of the observation samples for each cruise:

[0106]

[0107] Step 22: Perform a second average on the average values ​​of all voyages, and the resulting value is the sample value of the spatial unit at that time.

[0108]

[0109] Step 23: Divide the monthly observation sample data from Step 22 into subsets by month, and use a process pool to perform the spatial matching process between each subset and the feature set in parallel. The process is as follows: Figure 3 As shown in the figure. Among them, samples with missing optical features are retained.

[0110] Step 3: Sample mixing inverse frequency weights are established based on spatial grids and statistical analysis for XGBoost model training, and feature selection is performed according to the feature importance of the pre-trained model. Furthermore, hyperparameter search and 10-fold cross-validation are used to further optimize the model in terms of both cost and accuracy.

[0111] The partial pressure of carbon dioxide in the sea surface is a function of temperature, salinity, and alkalinity, and is closely related to marine physical, biological processes, and the chemical environment. Therefore, sea surface temperature, sea surface salinity, chlorophyll a, and the depth of the mixed layer are selected as the direct physical, chemical, and biological factors affecting the partial pressure of carbon dioxide in the sea surface. The mole fraction of carbon dioxide in the dry atmosphere and the 10-meter wind speed at the sea surface are considered as influencing factors on air-sea carbon dioxide exchange. Simultaneously, raw remote sensing reflectance and the optical properties of the ocean surface are also used as feature elements to provide more information about the physical, chemical, biological, and other processes at the sea surface, aiming to achieve better fitting results at high spatial resolution.

[0112] The XGBoost model employs ensemble learning and gradient descent techniques from machine learning. It generates multiple tree-based models using multiple features, summing all predictions to improve decision-making performance. Each base model fits the residual of the previous base model, representing the loss function between the predicted and actual values. Simultaneously, the negative gradient of the previous tree's loss function is calculated as the basis for generating new trees, rapidly reducing the magnitude of systematic errors. Furthermore, the XGBoost model integrates sample weights, sparse feature learning, and overfitting prevention strategies. Through parallel optimization, buffering, and out-of-kernel computation, it significantly improves training and prediction efficiency, making it suitable for large, imbalanced datasets of sea surface carbon dioxide partial pressure at high spatial resolution and effectively addressing situations where remote sensing data is partially missing.

[0113] For a dataset D = {(x i , y i )} (i = 1,2,…,n), where x i (x i ∈ m ) represents the feature set of the sample, y i (y i ∈ () represents the sample label value. The XGBoost model achieves prediction using an additive model of K base models:

[0114]

[0115] in, K represents the predicted value of the model, and K represents the number of tree models. Let L represent the k-th tree model, and F represent the hypothesis space of the base models. Define the objective function L:

[0116]

[0117] in, This represents the prediction error for the i-th sample. The complexity of the k-th tree model is defined as:

[0118]

[0119] in, Indicates the number of leaf nodes. This represents the weight score of the leaf. and This represents the weight coefficient of the penalty term.

[0120] Based on the above theory, in this embodiment, the specific method for constructing the XGBoost model in step 3 is as follows:

[0121] Step 31: Based on a certain spatiotemporal scale and statistical analysis, calculate the reciprocal of the sample density within the block to which the sample belongs in the three dimensions of time, space, and value range to establish the sample mixture inverse frequency weights. Defined as:

[0122]

[0123] in, , , These represent the spatial weight, temporal weight, and statistical value weight of the sample, respectively, and δ represents the weight scaling factor.

[0124] The monthly observation dataset was divided into sub-datasets based on year, month, and 1° grid. ,but The spatial weights of in-samples are determined by the sample size of the dataset, i.e.:

[0125]

[0126] Subdatasets are divided by month and 1° grid. The time weight is determined by the number of subsets of the corresponding month for that pixel, i.e.:

[0127]

[0128] The bias d is defined as the integer part of the quotient of the difference between the sample value and the mean of all samples and the standard deviation of all samples. The dataset is then divided into subsets based on d. The statistical weights are determined by the number of samples in the dataset. In practice, d can be adjusted according to the range of values ​​in the data distribution, i.e.:

[0129]

[0130]

[0131]

[0132]

[0133] Step 32: Take the average result of multiple pre-training sessions, and use the number of times all features except spatiotemporal features are used as split tree nodes in the model as the evaluation criterion to filter out features with lower importance.

[0134] Step 33: Using the root mean square error (RMSE) as an indicator, initially search for optimal values ​​for the number of iterations (n_estimator) within a small range. Based on this, use ten-fold cross-validation to perform a grid search on the two parameters: maximum tree depth (max_depth) and minimum leaf weight (min_child_weight). Considering both hardware and computational costs, select the optimal parameters that enable the model to achieve higher accuracy.

[0135] Step 34: Using the sample feature set from Step 23, the sample weight model from Step 31, the feature elements selected in Step 32, and the model hyperparameters from Step 33, train the model under multiple random seeds to obtain multiple XGBoost models.

[0136] Step 4: Use the trained model to predict and reconstruct the monthly surface carbon dioxide partial pressure distribution of seawater at high spatial resolution. In the process, parallel subtasks are divided into units based on patches defined by month and meridian.

[0137] Therefore, through steps 1 to 3 described above, a model and sample feature dataset describing the nonlinear relationship between features and the partial pressure of carbon dioxide in the sea surface have been obtained. The specific reconstruction method for step 4 is briefly described below:

[0138] The model prediction subtask is executed in parallel, using the meridional patch feature vector set of the single-time data from step 15 as the unit. When a task is submitted, a counter is set to 20. At the end of each subtask, the corresponding time-based counter is decremented by 1. When the counter reaches zero, it indicates that the last subtask within that time-based period has been completed. A matrix reconstruction task is then automatically submitted to the process pool, which involves reconstructing the prediction result vector into a matrix based on the row and column numbers. The overall parallel strategy is as follows: Figure 4 As shown. In this invention, since there are multiple XGBoost models, the average of the prediction results of all XGBoost models in step 34 is used as the final reconstruction result.

[0139] The following example, using the reconstruction of global surface carbon dioxide partial pressure from 2000 to 2018, illustrates the specific data and results of each step described above:

[0140] 1) The SOCAT (The Surface Ocean CO2 Atlas) sea surface carbon dioxide fugacity dataset, version 2020, was used as the data source for sea surface carbon dioxide partial pressure observations. The ESA (The European Space Agency) OC-CCI (Ocean-Colour Climate Change Initiative) 4km×4km fusion product was used as high spatial resolution satellite remote sensing data, and the band products are shown in Table 1 below. Sea surface temperature used the monthly average L3 product from the MODIS satellite; sea surface salinity and mixing depth used daily data from ECCO2 (Estimating the Circulation and Climate of the Ocean, Phase II) Cube92; sea surface 10m wind speed and spatiotemporally complete SST data used the ERA5 single-layer monthly average dataset; and dry atmospheric carbon dioxide mole fraction used daily data from GML (Global Monitoring Laboratory) CarbonTracker CT2019B.

[0141] Table 1

[0142]

[0143] 2) Following step 1 above, the sea surface salinity, chlorophyll a, seawater mixed layer depth, dry atmospheric carbon dioxide mole fraction, and sea surface wind speed (10m interval) data are upsampled to a 4km spatial resolution using nearest-neighbor linear interpolation. Simultaneously, the monthly average values ​​of sea surface salinity, seawater mixed layer depth, and dry atmospheric carbon dioxide mole fraction are calculated. For MODIS sea surface temperature remote sensing data, ERA5 reanalysis data is used for completion; latitude, longitude, and month are added to the feature set, and trigonometric transformations are performed on the longitude, while logarithmic transformations with a base of 10 are performed on the seawater mixed layer depth and chlorophyll concentration a.

[0144] 3) Following step 2 above, the sea surface carbon dioxide fugacity observation dataset was corrected to partial pressure values. Based on the EXPOCODE field of the dataset, a weighted average method was used to grid the dataset into a 4km × 4km monthly sample set. After weighted averaging, the average standard deviation of the observed samples within the grid cells of the monthly sea surface carbon dioxide partial pressure observation set decreased from 3.00 μatm to 1.78 μatm. The final matched sample feature set contained 3.68 million data points, which were randomly divided into training and prediction sets at an 8:2 ratio.

[0145] 4) Following step 3 above, perform spatiotemporal scale and statistical analysis based on a 1° grid. Calculate the reciprocal of the sample density within the block corresponding to the sample in the three dimensions of time, space, and value range to establish the sample mixing inverse frequency weights. Take the average result of six pre-training iterations (e.g., ...). Figure 5 The model retained 24 features (darker parts) based on the importance of chlorophyll a concentration. Using RMSE as the metric, and with an optimal n_estimator value of 100, a 10-fold cross-validation grid search was performed to determine max_depth and min_child_weight. Considering hardware and computational costs, the model's estimator was set to 20,000 iterations, max_depth to 25, and min_child_weight to 5. The model was trained on a training set of sample features, sample weights, and model parameters under six random seeds. On the prediction set, the average correlation coefficient was 0.953, the root mean square error was 10.162 μatm, and the average relative error was 1.343%. Figure 6 The overall importance ranking of features in the trained model is shown. Overall, sea surface temperature and chlorophyll a concentration are more important than other features, indicating that thermodynamic and biological processes are the dominant factors influencing the partial pressure of carbon dioxide in the global sea surface. It can also be seen that, in addition to features directly related to marine biology, physics, and chemistry, sea surface optical properties such as the 412nm backscattering coefficient, the 510nm total absorption coefficient, and the 510nm phytoplankton absorption coefficient also make significant contributions to the prediction of sea surface carbon dioxide partial pressure.

[0146] 5) Following step 4 above, using the feature dataset obtained in step 1 and the six models obtained in step 3, the global sea surface carbon dioxide partial pressure from 2000 to 2018 was reconstructed in parallel, and the average value was taken as the final result. To further verify the model accuracy, error statistics were performed on the reconstruction results by year and major ocean region, and the results are shown in Table 2. Figure 7 As shown in the figure. The results show that the model has high accuracy and good fit in spatial distribution and time series, and no partial overfitting is observed.

[0147]

[0148] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the invention. Those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention. Therefore, all technical solutions obtained through equivalent substitution or transformation fall within the protection scope of the present invention.

Claims

1. A method for reconstructing the partial pressure of carbon dioxide in seawater with high spatial resolution based on multi-source data, characterized in that, The steps are as follows: S1: Acquire reanalysis and model data and remote sensing data related to seawater carbon dioxide partial pressure. Preprocess the obtained multi-source feature data. First, upsample the reanalysis and model data to the spatial resolution of the remote sensing data. Second, fill in the missing values ​​in the remote sensing data. Finally, add spatial and temporal features and perform numerical transformation on some features to make them spatially continuous, thereby completing the preprocessing to form a feature dataset. In the preprocessing of the feature dataset, parallel subtasks are divided into monthly data patches, and invalid values ​​are removed in advance to reduce the amount of data, thereby enabling parallel processing between different subtasks. S2: Based on the measured observation data of carbon dioxide partial pressure at sea surface, all collection cruises were gridded using a route-weighted average method, with the grid accuracy matching the spatial resolution of the remote sensing data, thus obtaining a monthly observation sample set of carbon dioxide partial pressure at sea surface. Then, the samples in the monthly observation sample set were spatially matched with the feature dataset to obtain a sample feature set. In the process of constructing the monthly observation sample set and the sample feature set, parallel sub-tasks were divided on a monthly basis based on the observation time to achieve parallel processing between different sub-tasks. S3: Based on spatiotemporal scale and statistical analysis, sample mixing inverse frequency weights are established for XGBoost model training, and features in the feature dataset are selected according to the feature importance of the pre-trained model; then, based on the selected features, the XGBoost model is further optimized using hyperparameter search and 10-fold cross-validation. S4: Using the trained XGBoost model, predict and reconstruct the monthly surface carbon dioxide partial pressure distribution of seawater at high spatial resolution; and divide the prediction process into parallel sub-tasks based on patches divided by month and meridian, so as to achieve parallel processing between different sub-tasks.

2. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 1, characterized in that: The specific method for forming the feature dataset through preprocessing in step S1 is as follows: S11: Using model and reanalysis data that include sea surface salinity, chlorophyll a, seawater mixed layer depth, dry atmospheric carbon dioxide mole fraction, and sea surface wind speed at 10 meters per second as feature data, and remote sensing data that includes sea surface temperature as feature data, the various model and reanalysis data are upsampled to the spatial resolution of remote sensing data and the spatial coordinate system is unified. At the same time, the temporal resolution of each feature data is unified to monthly, and the daily data is converted to the temporal resolution by calculating the monthly average. S12: Locate missing values ​​in sea surface temperature remote sensing data and fill them in using corresponding values ​​from full-coverage reanalysis data; S13: Add latitude and longitude as spatial features and month as time features to the feature dataset; perform trigonometric transformation on longitude and logarithmic transformation on seawater mixing layer depth and chlorophyll concentration a to make the two features spatially smooth and continuous.

3. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 2, characterized in that: During the construction of the feature dataset, for monthly single-time data, the average number of patches is used as the smallest unit for dividing parallel sub-tasks for parallel processing. Land and invalid grids are removed in parallel using a process pool. The effective units are converted from matrices to vectors and the row and column numbers of each unit in the original matrix are recorded.

4. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 2, characterized in that: The specific method for step S2 is as follows: S21: The measured sea surface carbon dioxide partial pressure dataset is divided into a series of subsets according to three dimensions: year, month, and spatial grid. The precision of the spatial grid is the same as that of the remote sensing data. The observation years in the measured sea surface carbon dioxide partial pressure dataset are... The observation month is Furthermore, the data in the grid cell located in the r-th row and c-th column of the spatial grid is a subset of the dataset. For a subset of datasets There are a total of C routes { Calculate the average value of the observation samples from each cruise if | j=(1, 2, …,C)}: In the formula: Indicates flight route The total number of observation points on the platform Indicates flight route The i-th observation point on; S22: Perform a second average on the average of all flights in a subset of data, and the resulting average... The sample values ​​of the spatial grid corresponding to this subset of data: After dividing all the subsets of the measured carbon dioxide partial pressure data at the sea surface into sub-tasks on a monthly basis, the calculations in S21 and S22 were completed through parallel processing to obtain a monthly observation sample set of carbon dioxide partial pressure at the sea surface with the same grid accuracy and spatial resolution as the remote sensing data. S23: Divide the data in the monthly observation sample set into subsets on a monthly basis, and use a process pool to process the spatial matching process between each subset and the feature dataset in parallel to obtain a sample feature set; each sample in the obtained sample feature set contains multiple feature data and observation data of sea surface carbon dioxide partial pressure, which are used to train the XGBoost model.

5. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 4, characterized in that: The specific method for step S3 is as follows: S31: Based on spatiotemporal scales and statistical analysis, calculate the reciprocal of the sample density within the block to which the sample belongs in the three dimensions of time, space, and value range to establish the sample mixture inverse frequency weight; where the sample mixture inverse frequency weight Defined as: in, , , These represent the spatial weight, temporal weight, and statistical value weight of the sample, respectively, and δ represents the weight scaling factor. The spatial weight calculation method for the samples is as follows: the monthly observation dataset is divided into subsets according to year, month, and spatial grid. ,but Spatial weights of in-samples The calculation formula is determined by the sample size of this subset of data: In the formula: It is a counting function; The time weighting method for the samples is as follows: the monthly observation dataset is divided into sub-datasets according to the month and spatial grid. Then time weight The value is determined by the number of grid cells in the corresponding month's subset, and the calculation formula is: The method for calculating the statistical weights of samples is as follows: Define the bias 'd' to describe the interval of the sample value within the overall sample set's value range, and divide the monthly observation dataset into subsets based on 'd'. Statistical value weights The number of samples in the dataset determines the calculation formula: S32: Based on the establishment of sample mixing inverse frequency weights and the feature data and observation data in the samples, the XGBoost model is pre-trained multiple times. Based on the average result of multiple pre-training, the number of times all features except spatiotemporal features are used as split tree nodes in the model is used as the evaluation criterion to filter out feature elements with lower importance and retain key feature elements as model input. S33: Ten-fold cross-validation is used to perform grid search on the hyperparameters of the XGBoost model. Taking into account hardware and computational costs, the model hyperparameters that enable the model to achieve the target accuracy are selected. S34: Combine the sample feature set obtained in S23, the sample mixing inverse frequency weights in S31, the feature elements selected in S32, and the model hyperparameters in S33 to train the XGBoost model under multiple random seeds, resulting in multiple XGBoost models for prediction.

6. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 5, characterized in that: The specific method of step S4 is as follows: In step S33, the hyperparameters of the model for grid search include the maximum tree depth and the minimum leaf weight.

7. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 5, characterized in that: The specific method for step S4 is as follows: For the model prediction task of monthly sea surface carbon dioxide partial pressure distribution, the task is divided into parallel subtasks based on monthly and meridional patches. The subtasks are processed in parallel. During the processing, each trained XGBoost model in S34 is used to predict the monthly sea surface carbon dioxide partial pressure distribution based on the feature data corresponding to each subtask. The average of the prediction results of all XGBoost models is taken as the final reconstruction result, thus reconstructing the monthly sea surface carbon dioxide partial pressure distribution with high spatial resolution.

8. The high spatial resolution seawater carbon dioxide partial pressure reconstruction method based on multi-source data according to claim 7, characterized in that: When executing the model prediction subtask in parallel, a counter is used to record the prediction progress of the patches. When all tasks are completed within a single time interval, the matrix restoration task is automatically submitted, that is, the prediction result vector is restored into a matrix according to the row and column numbers.