A method for data interpolation of eddy covariance net ecosystem exchange
By combining random forest models, time series additive models, and empirical mode decomposition methods, this method interpolates net ecosystem carbon dioxide exchange data based on eddy covariance, solving the problems of low interpolation accuracy and insufficient applicability in existing technologies, and achieving high-precision data recovery under different vegetation cover and geographical environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2023-05-18
- Publication Date
- 2026-06-12
AI Technical Summary
The existing eddy covariance method has low accuracy in imputing missing flux data and lacks a robust imputation scheme applicable to different vegetation cover and geographical environments of global sites, especially in imputing long-term missing data.
We used a random forest model and a time series additive model (TSA) combined with empirical mode decomposition (EMD) to impute net ecosystem carbon dioxide exchange (NEE) data based on eddy covariance. Through preprocessing, model training, decomposition of trend and fluctuation terms, and imputation of missing values using influencing factor data, we complemented the advantages of different decomposition methods.
It improves the accuracy and robustness of interpolation and is applicable to net ecosystem exchange interpolation under various geographical environments and vegetation covers, especially showing a significant improvement in long blank data interpolation.
Smart Images

Figure CN116756495B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of atmospheric science and carbon dioxide flux technology, specifically to a method for interpolating net ecosystem exchange data based on eddy covariance. Background Technology
[0002] Eddy covariance (EC) is an internationally recognized method for observing the exchange of greenhouse gases between ecosystems and the atmosphere. Since the 20th century, tens of thousands of flux tower sites have been established worldwide. As an observational technique for directly observing the material and energy fluxes between terrestrial ecosystems and the atmosphere, it is an important observational tool for the International FluxNet and numerous meteorological, ecological, and hydrological observation stations, playing a crucial role in global change research. However, due to various factors such as power outages, instrument malfunctions and maintenance, and data quality checks, many long-term EC sites experience gaps in approximately 20% to 60% of half-hour data points annually, with some periods of continuous data loss lasting up to half a month or even a month.
[0003] There are dozens of commonly used imputation methods for missing flux data based on traditional eddy covariance, including methods using linear / multivariate regression models, lookup tables, multiple attribution, marginal distribution sampling, and machine learning. However, to date, there is no international consensus on imputation methods for missing flux data, and the existing methods have low imputation accuracy, making imputation of long-term missing data even more difficult.
[0004] Furthermore, current methods often only address one type of vegetation cover or are applicable only to specific environments, lacking a robust and effective carbon flux interpolation scheme for different vegetation covers and geographical scenarios at global sites. Therefore, a universal and robust NEE interpolation method plays a crucial role in quantifying the interannual variability of carbon budgets and in research on material and energy exchange in terrestrial ecosystems. Summary of the Invention
[0005] To address the aforementioned problems, this invention aims to provide a method for imputing net ecosystem carbon dioxide exchange (NEE) data based on eddy covariance, thereby obtaining complete and reliable flux data by imputing missing data.
[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0007] A method for interpolating net ecosystem carbon dioxide exchange data based on eddy covariance, characterized by comprising the following steps:
[0008] Step 1: Acquire observed meteorological data and carbon dioxide flux data (NEE);
[0009] Step 2: Preprocess the obtained meteorological data and carbon dioxide flux data to obtain effective NEE data and corresponding impact factor data;
[0010] Step 3: Train a random forest model based on the data obtained in Step 2, and use the trained random forest model to pre-impute missing values to obtain complete NEE time series data;
[0011] Step 4: Process the obtained complete NEE time series data based on the Time Series Addition (TSA) model to obtain the trend term and fluctuation term, remove the original missing values, divide the dataset into trend term segments, train the machine learning model and perform imputation to obtain the imputed trend term data.
[0012] Step 5: Process the obtained complete NEE time series data based on the empirical mode decomposition algorithm to obtain the trend term and fluctuation term, remove the original missing values, divide the dataset into trend terms, train the machine learning model and perform imputation to obtain the imputed trend term data.
[0013] Step 6: Compare the interpolated data obtained in Step 4 with the interpolated data obtained in Step 5. The advantages of different decomposition methods can complement each other. TSA can effectively extract the trend of NEE, while EMD can make up for the problem that TSA may not be accurate and distorted in trend extraction under long gaps.
[0014] Step 7: Superimpose the trend term obtained from the complete interpolation in Step 4 or Step 5 with the fluctuation term obtained from the corresponding algorithm decomposition to obtain the final NEE interpolation result.
[0015] Furthermore, the specific steps for step 2 are as follows:
[0016] Step 21: Perform cubic spline interpolation on the meteorological data corresponding to NEE to obtain data with half-hour intervals;
[0017] Step 22: Divide the carbon dioxide flux data into three levels, conduct a quality assessment, and retain the high-quality carbon dioxide flux data with a score of 0.
[0018] Step 23: Remove the missing NEE data and corresponding impact factor data, and retain the valid NEE and corresponding factor data.
[0019] Furthermore, the specific operational steps of step 3 include:
[0020] Step 31: Divide the NEE valid data obtained in Step 2 into a training set of 75% and a test set of 25%.
[0021] Step 32: Train the random forest model on the training set and test the model performance on the test set. Use grid search to obtain the optimal combination of model parameters.
[0022] Step 33: Input the missing NEE impact factor data into the trained RF model to impute the missing values and obtain the complete NEE time series data.
[0023] Furthermore, the specific operational steps of step 4 include:
[0024] Step 41: Define T as the time series, P as the trend term, and R as the volatility term, then we have:
[0025] ;
[0026] Step 42: Record the NEE time series data completed in Step 3 as follows: , ;
[0027] Step 43: Use the moving average method to analyze the complete NEE time series data. The trend term is obtained by decomposing the trend:
[0028] ;
[0029] in, It is the trend term after decomposition. It is a cycle;
[0030] Step 44: Subtract the trend term from the original sequence to obtain the fluctuation term R: ;
[0031] Step 45: Remove missing trend terms from the original data, train and test the machine learning model based on the obtained trend terms, then input the missing value influence factor data into the trained machine learning model to impute the trend terms, and finally superimpose the imputed result with the corresponding value fluctuation term to complete the imputation of NEE.
[0032] Furthermore, the specific operational steps of step 5 include:
[0033] Step 51: Obtain the completed NEE time series data The maximum and minimum points are obtained by fitting these extreme points using curve interpolation to obtain the upper envelope of the signal. and lower envelope ;
[0034] Step 52: Calculate the average value of the upper and lower envelopes to obtain the average envelope. :
[0035] ;
[0036] Step 53: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require the and Subtract to obtain the remaining signal ;
[0037] Step 54: For the remaining signal Repeat steps 51-53 until SD is less than the threshold value, and then stop to obtain the appropriate first-order modal components. The formula for calculating SD is:
[0038] ;
[0039] Step 55: Signal and By subtracting, we obtain the first-order residual. ,use Replace the original signal After repeating steps 51-55 n times, the nth-order mode function is obtained. and the final residual quantity that meets the standard Thus obtain The expression after EMD decomposition is:
[0040] ;
[0041] Step 56: Reconstruct the low-frequency components into annual and quarterly trend terms that are similar in proportion to the time series trend term, and use the sum of the other high-frequency components and residuals as the volatility term;
[0042] Step 57: Remove missing trend terms from the original data, train and test the machine learning model based on the trend terms after EMD decomposition, then input the data of missing value influencing factors into the trained machine learning model to imput the trend terms, and finally superimpose the imputation results with the corresponding value fluctuation terms to complete the NEE imputation.
[0043] Furthermore, the influencing factor data includes: air temperature, shortwave radiation, precipitation, saturated vapor pressure difference, wind speed, soil temperature, soil moisture content, normalized difference vegetation index, enhanced vegetation index, and three fuzzy variables (the decimal days and the sine and cosine functions of the decimal days recorded every half hour each year).
[0044] Compared with other interpolation methods, the advantages of this invention are:
[0045] First, traditional methods often use marginal distribution sampling (MDS) and other traditional interpolation methods for interpolation. Although machine learning RF models in most recent studies have better performance than traditional schemes and can interpolate long gaps better, the method proposed in this invention has significantly improved the accuracy of interpolation and the interpolation effect on long gaps compared with the RF model. At the same time, it is applicable to the interpolation of net ecosystem exchange of different vegetation covers in various geographical environments.
[0046] Second, simulation results show that the present invention not only significantly improves the accuracy of interpolation, but also has good robustness and adaptability.
[0047] Third, this invention selected five sites representing different vegetation types by selecting surface coverings, and used RF alone to verify the artificial gaps under four different length conditions. The results were compared with the performance of the method proposed in this invention. The experimental results show that the accuracy of interpolation by this invention is significantly higher than that of RF filling alone. Attached Figure Description
[0048] Figure 1 This is a flowchart of the method proposed in this invention.
[0049] Figure 2 Box tracing and interval plots were used to test four different artificial gap test indicators at 25 stations in a certain area.
[0050] Figure 3 This is a schematic diagram illustrating the results of trend term interpolation using a time series additive model (TSA) at the semi-arid site US-Hn1 according to the present invention.
[0051] Figure 4 This is a schematic diagram showing the results of trend term extraction and interpolation at the semi-arid site US-Hn1 using Empirical Mode Decomposition (EMD). Detailed Implementation
[0052] To enable those skilled in the art to better understand the technical solutions of the present invention, the technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.
[0053] To obtain complete and reliable throughput data, this invention provides a reasonable, reliable, and robust imputation method for imputing missing data, which specifically includes the following steps:
[0054] Step 1: Acquire meteorological and flux data
[0055] Data can be obtained from FLUXNET or similar observation sites, including observed meteorological and flux data.
[0056] The meteorological data include: air temperature (TA), shortwave radiation (SW), precipitation (P), saturated vapor pressure difference (VPD), wind speed (WS), soil temperature (TS), and soil moisture content (SWC).
[0057] The input data also includes the Normalized Differential Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) obtained from satellite observations, as well as three fuzzy variables (the fractional days of the time points recorded every half hour each year and the sine and cosine functions of the fractional days).
[0058] Step 2: Process the acquired data
[0059] Python was used to process the acquired meteorological data and fluxes. This primarily involved processing the meteorological data corresponding to NEE (carbon dioxide flux) where gaps existed (some meteorological data records were recorded in daily intervals instead of half-hour intervals, thus creating gaps). Cubic spline interpolation was performed on the meteorological data to obtain data with half-hour intervals, and then useless data was removed. Next, the NEE data quality was evaluated to select high-quality carbon dioxide flux data for model training, while low-quality carbon dioxide flux data was considered missing. Specifically, NEE data quality was divided into three levels: gaps caused by field instrument or electrical malfunctions or maintenance received a score of 2; other low-quality data received a score of 1; and data with a score of 0 represented high-quality data. High-quality data (i.e., scores of 0) was ultimately selected for model training and gap filling. After data preprocessing, Python was used to remove missing NEE data and corresponding impact factor data, retaining only valid NEE and corresponding impact factor data for model training.
[0060] Step 3: Obtain complete NEE time series data based on the RF model.
[0061] The effective NEE data was divided into a training set (75%) and a test set (25%). A randomized search (RF) model was trained on both sets, using grid search and randomized search to find the optimal parameters. If the data volume was small and the number of hyperparameter combinations was limited, grid search (GridSearchCV) was used; if the data volume was large and the number of parameter combinations was high, randomized search (RandomizedSearchCV) was chosen. Five rounds of cross-validation were then used to search for the best hyperparameters. After the model training was complete, missing values were imputed using the input impact factor data to obtain the complete NEE time series data.
[0062] When performing data interpolation using machine learning models (employing publicly available machine learning models), to improve interpolation accuracy, a time-series additive model is used to extract the overall trend and fluctuation terms of the interpolated NEE. The specific steps include:
[0063] Let T be the time series, P be the trend term, and R be the volatility term. Since the data records after filling are at half-hour intervals, the trend term P is decomposed using the moving average method (i.e., the time series additive model) with a period of 48 intervals (i.e., one day). Then, the volatility term R is obtained by subtracting the trend term from the NEE time series data.
[0064]
[0065] The data comes from complete NEE time series data recorded every half hour after interpolation. Let the interpolated NEE time series data be... for:
[0066] ;
[0067] Decomposing NEE time series data using the moving average method:
[0068]
[0069] In the formula, It is the trend term after decomposition. It refers to the period. The period used in this invention is one day, or 48. The fluctuation term is obtained by subtracting the trend term from the original sequence. .
[0070] To compensate for the errors that the moving average method may introduce under long-term blanking, this invention proposes using Empirical Mode Decomposition (EMD) for component reconstruction. Component reconstruction refers to the process of decomposing the complete time series data after NEE completion into multiple components, namely multiple IMFs and a residual (res). This includes high-frequency and low-frequency components. Reconstruction means superimposing the high-frequency components to obtain the volatility term, and superimposing the low-frequency components and res to obtain the trend term. In other words, the data processed by component reconstruction is the completed NEE time series data, and the resulting components are the multiple IMF components obtained after NEE decomposition and the residual. It can be seen that the time-frequency analysis method based on EMD is suitable for the analysis of both nonlinear and non-stationary signals, as well as linear and stationary signals.
[0071] The EMD method assumes that any signal is composed of different IMFs, each of which can be linear or nonlinear. IMF components must satisfy two conditions: (1) the number of their extrema and zero-crossings must be the same or differ by at most one; (2) their upper and lower envelopes must be locally symmetrical about the time axis. IMFs are generated by filtering the raw data, which is an iterative process. The leaked signal can be decomposed by EMD into the sum of several IMFs and a residual function.
[0072] The basic steps of the EMD algorithm are as follows:
[0073] Step 1: Obtain the completed NEE time series data The maximum and minimum points are identified, and then these extreme points are fitted using curve interpolation to obtain the upper envelope of the signal. and lower envelope .
[0074] Step 2: Calculate the average value of the upper and lower envelope paths to obtain the average envelope. :
[0075]
[0076] Step 3: Convert the original signal With average envelope Subtract to obtain the remaining signal Generally speaking, for a stationary signal, it is the original signal. The first mode function (IMF) is used, but for non-stationary signals, the signal is not a monotonic change (e.g., monotonically increasing within a certain region), but rather exhibits multiple inflection points. That is, due to the complexity of signal changes, if the original signal can be accurately reflected... If the inflection point of a specific feature is not selected, the resulting first-order mode function is inaccurate, so further filtering is required.
[0077] Step 4: For the remaining signal Repeat steps 1-3 until SD (the screening threshold, typically 0.2-0.3) is less than the threshold value, at which point the process stops, ultimately yielding a suitable first-order modal component. This refers to the first IMF. The SD is calculated as follows:
[0078]
[0079] Where n represents the time series data sequence number, and k represents the label of the remaining signal;
[0080] Step 5: Send the signal and By subtracting, we obtain the first-order residual. ,use Replace the original signal Perform steps 1-5 repeatedly n times to obtain the nth-order mode function. and the final residual quantity that meets the standard The final expression of the original signal after EMD decomposition is:
[0081]
[0082] As shown above, by decomposing and reconstructing the NEE-filled sequence through EMD, the time series data is decomposed into multiple components c(n) and residuals r(n). The low-frequency components (IMF) after decomposition are summed to ensure that the difference between the summation of the low-frequency components and the trend term after time series decomposition is no more than 0.15 of the total data. The low-frequency components are reconstructed into annual and quarterly trend terms with a similar proportion to the trend term of the time series. The sum of the other high-frequency components and residuals is used as the fluctuation term.
[0083] Obtain the trend and fluctuation terms after EMD decomposition, then remove the missing trend terms from the original data again. Use the influence factor data to train and test the machine learning model separately on the trend terms after NEE decomposition without missing values. Search and adjust the hyperparameters again or directly use the parameters of the first imputation model. After training, input the missing value influence factor data to imput the trend terms again. Finally, superimpose the corresponding value fluctuation terms to complete the imputation of NEE.
[0084] This invention uses the Time Series Addition Model (TSA) and Empirical Model Decomposition (EMD) to decompose the completed NEE time series data. First, the NEE time series data is completed by RF, then TSA and EMD are used to decompose the completed NEE, and finally machine learning algorithms are used to complete the interpolation.
[0085] This invention employs a parallel approach using both TSA and EMD decomposition methods to create complementarity and a comparative effect. TSA effectively extracts long-term NEE trends, capturing over 95% of the total NEE in interannual NEE and effectively smoothing the data, reducing its complexity. However, using TSA with a daily cycle may lead to inaccurate trend extraction or errors over long intervals. EMD effectively addresses nonlinearity issues, but the recombined components (IMFs) after EMD decomposition result in a slightly weaker capture of annual NEE compared to TSA. Therefore, EMD decomposition is used in parallel for comparison, providing a complementary and comparative effect.
[0086] Example
[0087] Since this invention addresses the imputation of missing data, the inherent data gaps limit the accuracy and performance verification results of the method. The common method currently is to create artificial gaps in data without missing gaps to verify the performance of the imputation method. Therefore, this invention also uses the method of creating four different lengths of artificial gaps to verify the performance of this method.
[0088] 1. Data source for the experiment: NEE data and meteorological data from the FLUXNET2015 site.
[0089] The 2021 publication, "Technical note: Uncertainties in eddy covariance CO2 fluxes in a semiarid sagebrush ecosystem caused by gap-filling approaches," compared the interpolation results of the semiarid site US-Hn1 with the RF schemes described in the aforementioned 2021 publication, under the same artificial gap length. The comparison results are shown in Table 1. The RF data are the error analysis data of the best RF scheme in the 2021 publication. TSA-RF (Time Series Additive Model Decomposition and Combination RF) and EMD-RF (Empirical Mode Decomposition and Combination RF) are the comparative experimental results obtained by decomposing and combining the RFs using the present invention under the same artificial gap length.
[0090] Table 1
[0091]
[0092] As shown in Table 1, this method significantly improves the accuracy of interpolation. Furthermore, this invention employs a time series method that first interpolates and then decomposes the NEE using EMD and moving average. The interpolation results were evaluated by comparing the method with four different machine learning algorithms (XGboost, RF, SVR, and BP) at 25 stations in a certain region under four different artificial gap lengths. Experimental results show that the proposed method of first completing, then decomposing, and finally filling the NEE significantly improves the performance of the previous direct interpolation method across various indicators. For a 1-hour (short) gap length, the interpolation results at the 25 stations were obtained by combining EMD with RF and XGboost. The mean value is 0.98, and the average RMSE values are 0.492 and 0.480, respectively, using time series decomposition. The average value was 0.99, and the average RMSE values were 0.357 and 0.364, respectively. The interval length was two months (very long). The EMD decomposition and combination of RF and XGboost... The average values decreased to 0.88 and 0.85, while the average RMSE values increased to 1.60 and 1.59, respectively, using time series decomposition. The average values decreased to 0.88, while the average RMSE values increased to 1.365 and 1.376, indicating that the method has good robustness and adaptability, and also greatly improves the imputation effect for data with long gaps.
[0093] To further verify the effectiveness of the method proposed in this invention, the following experiments were conducted:
[0094] The 25 vortex covariance flux towers used in the experiment were all distributed within a certain region. The data met the requirement that the observed data volume was at least greater than one calendar year of flux data, and the recorded data included a complete set of input factor data required for interpolation, including: solar radiation, temperature, humidity, wind speed, and wind direction. This experiment selected data with a sampling rate of 1 Hz, recording data on average every 30 minutes. The stations were distributed across various regions within the area, spanning multiple climate zones, mainly temperate continental climate, temperate maritime climate, and subtropical humid climate, and were diverse and representative, including five types of vegetation: farmland, grassland, shrubland, evergreen coniferous forest, and deciduous broad-leaved forest.
[0095] In this experiment, only high-quality NEE data (i.e., scores of 0) was used for model training and gap filling. To evaluate the gap-filling effect, introducing artificial gaps into the data is essential. This invention generates four different lengths of artificial gaps in the decomposed dataset, uses four different methods to fill them, and finally superimposes the fluctuation term of the artificial gaps to obtain a comparison between the predicted and observed values under the NEE artificial gaps. It is difficult to perfectly replicate the size and location of artificial gaps in real-world scenarios. To control the quality of artificial gaps and eliminate the potential influence of sample size and gap location on performance evaluation, a training and test set of 10 independent samples of artificial gaps was created for each type of gap.
[0096] This invention uses meteorological data from some observation stations provided by the FLUXNET2015 database as model inputs, including air temperature (TA_F), shortwave radiation (SW_IN_F), precipitation (P_F), saturated vapor pressure difference (VPD_F), wind speed (WS_F), soil temperature (TS_F_MDS_1), and soil moisture content (SWC_F_MDS_1). In addition to the above meteorological variables, the input variables of the ML algorithm also include the Normalized Differential Vegetation Index (NDVI) and Enhanced Vegetation Index (EVI) from the Moderate Resolution Imaging Spectroradiometer (MODIS), as well as three fuzzy variables (i.e., the number of decimal days per year and sine and cosine functions to represent seasonal variations).
[0097] In the experiment, missing values were first removed from the NEE time series data with gaps, retaining only data with a score of 0. The data was then divided into training and test sets. The "randomForest" R package was used to create 400 regression trees for each case, with each node containing 3 variables in its binary tree. All influencing factors were used as inputs to train the model and complete the first imputation to obtain complete NEE time series data. Next, a time series addition model and EMD were used to decompose the imputed NEE time series into trend and volatility components. Missing data in the trend component was then removed, and the data without gaps was used as the training set. Four different lengths of artificial gaps (10%–15%) were randomly generated, and the remaining data (80% for training and 20% for testing) were used to train the model. Four machine learning algorithms—XgBoost, RF, SVR, and BP neural networks—were combined. To ensure consistency in model comparison, all input data were normalized. The specific model parameters are designed as follows: In the RF experiment, the decomposed NEE time series trend term was optimized using Python grid search and 5-time cross-validation. The searched hyperparameters included the number of trees (50-5000), the maximum number of features selected when building the decision tree (0.2-0.8), the maximum depth of the decision tree (10-360), the minimum number of samples required for each leaf node (1, 2, 4, or 6), and the minimum number of samples required to split a node (2, 5, 10, 12, or 15). The XGboost parameters included the learning rate (0.01, 0.1, or 0.02), the minimum loss function decrease required for node splitting (0-0.5), the minimum sum of sample weights in child nodes (1, 2, 5, or 10), and the feature sampling ratio when building the tree (0.6-1). The SVR tuning parameters included the kernel function and cost regularization parameters (C=1, 10, 100, or 100). In the BP experiment, the Keras library in Python was used for the structure design and parameter setting of the BP neural network. After determining the parameters, the missing values of the artificial gaps in the trend term are predicted and then superimposed with the separated fluctuation term to obtain the NEE filling result and verify the model accuracy and feasibility.
[0098] The optimal parameters for the four algorithms in the experiment are:
[0099] RF: n_estimators=1636, min_samples_split=5, min_samples_leaf= 2, max_features=0.5, max_depth=None, bootstrap=False, random_state=0
[0100] XGboost:subsample= 0.8,seed=0,reg_lambda=1, reg_alpha=0, n_jobs = -1,n_estimators=3333,min_child_weight=5,max_depth=298, learning_rate=0.01, gamma= 0.0, colsample_bytree= 0.7
[0101] SVR:kernel='rbf', gamma=0.1,C=100
[0102] Backpropagation Neural Network: Structure: Input layer - Intermediate layer - Output layer = 120 - 10 - 1, Activation function: sigmoid, Training 200 times.
[0103] This experiment uses four commonly used performance indicators: coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), and bias to statistically compare the gap-filling value within the artificial gap and the original measurement value for NEE at each station. The formulas are as follows:
[0104]
[0105]
[0106]
[0107]
[0108] In the formula, For measured values, Indicates the predicted value. and These represent the average values of the measured and predicted values, respectively.
[0109] This invention uses a time series additive model and EMD decomposition algorithm to extract trend and fluctuation terms. Then, four machine learning algorithms (Xgboost, RF, SVR, BP) are used to pre-interpolate the extracted trend terms and superimpose the fluctuation terms. The results were then tested for four different gap lengths. The interpolation results of 25 stations in a certain region under four different artificial gap lengths were evaluated. The experimental results show that the method of first completing, then decomposing, and finally filling the NEE under different gap lengths significantly improves the performance of all indicators compared with the previous direct interpolation.
[0110] Overall, the performance of each algorithm decreases with increasing gap length after using EMD and moving average decomposition. For all stations with four gap conditions, XGboost outperforms SVR and BP neural networks in RF. For EMD and time series decomposition, time series decomposition is slightly better than EMD. For a gap length of 1 hour (short), all algorithms... The highest values were Bias, RMSE, and MAE, while the lowest were [missing values]. For a 1-hour (short) interval length, EMD was used to decompose and combine RF and XGboost. The average value is 0.98, and the average RMSE values are 0.492 and 0.480, respectively. Time series decomposition was used for all sites. The average value was 0.99, and the average RMSE values were 0.357 and 0.364, respectively. However, for a very long interval of two months, the EMD decomposition combined all sites of RF and XGboost... The average values decreased to 0.88 and 0.85, while the average RMSE values increased to 1.60 and 1.59, respectively. Time series decomposition was used for all sites. The average values decreased to 0.88, while the average RMSE values rose to 1.365 and 1.376, with the overall RF slightly better than XGboost. The same trend was observed with Bais and MAE, showing an upward trend as the gap lengthened. Notably, the time series decomposition method performed better in filling longer gaps.
[0111] To verify the performance improvement of the decomposed model, this invention selected five sites representing different vegetation types using surface cover: farmland (GZ1), grassland (AR1), shrubland (SK2), evergreen coniferous forest (Me6), and deciduous broad-leaved forest (oho). Then, Randomized Randomized Forward (RF) was used alone (keeping all parameters consistent with the decomposed RF algorithm) to verify the performance improvement of artificial gaps under four different lengths, and the results were compared with those of the decomposed algorithms. Experimental results show that filling with the decomposed trend term is significantly more effective than filling with RF alone. The median mean of each index after EMD decomposition under the four different gap conditions shows a shift from the BP neural network (…). =0.930, RMSE=1.407, Bias=-0.006, MAE=1.046), to SVR ( =0.927, RMSE=1.39, Bias=-0.015, MAE=1.040), to XGboost ( =0.938, RMSE=1.337, Bias=-0.068, MAE=0.966) and RF ( =0.939, RMSE=1.307, Bias=-0.016, MAE=0.942), the median of the average of each indicator after time series decomposition is shown from the BP neural network ( =0.957, RMSE=1.168, Bias=0.0002, MAE=0.840), to SVR ( =0.959, RMSE=1.159, Bias=0.047, MAE=0.846), to XGboost ( =0.964, RMSE=1.071, Bias=-0.1352, MAE=0.766) and RF ( =0.966, RMSE=1.041, Bias=-0.016, MAE=0.750) are all better than using RF alone. =0.78, RMSE=2.58, Bias=-0.17, MAE=1.62).
[0112] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely illustrative of the principles of the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of this invention is defined by the appended claims and their equivalents.
Claims
1. A method for interpolating net ecosystem exchange quantity data based on eddy covariance, characterized in that, Includes the following steps: Step 1: Acquire observed meteorological data and carbon dioxide flux data (NEE); Step 2: Preprocess the obtained meteorological data and carbon dioxide flux data to obtain effective NEE data and corresponding impact factor data; Step 3: Train a random forest model based on the data obtained in Step 2, and use the trained random forest model to pre-impute missing values to obtain complete NEE time series data; Step 4: Based on the time series addition model TSA, process the obtained complete NEE time series data to obtain the trend term and fluctuation term, remove the trend term corresponding to the missing values in the observed meteorological data and carbon dioxide flux data NEE, divide the dataset, train the machine learning model and perform imputation to obtain the complete imputed trend term data. Step 5: Process the obtained complete NEE time series data based on the Empirical Mode Decomposition (EMD) algorithm to obtain the trend term and fluctuation term. Remove the trend term corresponding to the original missing values in the observed meteorological data and carbon dioxide flux data NEE, divide the dataset, train the machine learning model and perform imputation to obtain the complete imputed trend term data. Step 6: Compare the interpolated trend data obtained in Step 4 with the interpolated trend data obtained in Step 5. The advantages of different decomposition methods can complement each other. TSA can effectively extract the trend of NEE, while EMD can make up for the problem that TSA may be inaccurate and distorted in trend extraction under long gaps. Step 7: Superimpose the trend term data obtained from Step 4 or Step 5 with the fluctuation term obtained from the corresponding algorithm decomposition to obtain the final NEE interpolation result; The specific operational steps of the time series addition model TSA in step 4 include: Step 41: Define T as the time series, P as the trend term, and R as the volatility term, then we have: ; Step 42: Record the NEE time series data completed in Step 3 as follows: ; Step 43: Use the moving average method to analyze the complete NEE time series data. The trend term is obtained by decomposing the trend: ; in, It is the trend term after decomposition. It is a cycle; Step 44: Calculate the fluctuation term ; Step 45: Remove missing trend terms from the observed meteorological and carbon dioxide flux data NEE. Train and test the machine learning model based on the obtained trend terms. Then, input the influencing factor data corresponding to the missing values into the trained machine learning model to imput the trend terms. Finally, superimpose the imputed trend terms with the fluctuation terms corresponding to the trend terms to complete the imputation of NEE.
2. The method according to claim 1, characterized in that: The specific steps for step 2 are as follows: Step 21: Perform cubic spline interpolation on the missing meteorological data corresponding to NEE to obtain data with half-hour intervals; Step 22: Divide the carbon dioxide flux into three levels, conduct a quality assessment, and retain the high-quality carbon dioxide flux data with a score of 0. Step 23: Remove the missing NEE data and corresponding impact factor data, and retain the NEE and corresponding impact factor data that are not missing.
3. The method according to claim 2, characterized in that, Step 3 includes the following specific steps: Step 31: Divide the NEE valid data obtained in Step 2 into a training set of 75% and a test set of 25%. Step 32: Train the random forest model on the training set and test the model performance on the test set. Use grid search to obtain the optimal combination of model parameters. Step 33: Input the missing NEE impact factor data into the trained RF model to impute the missing values and obtain the complete NEE time series data.
4. The method according to claim 3, characterized in that, Step 5, Empirical Mode Decomposition (EMD), includes the following specific operational steps: Step 51: Obtain the completed NEE time series data The maximum and minimum points are obtained by fitting these extreme points using curve interpolation to obtain the upper envelope of the signal. and lower envelope ; Step 52: Calculate the average value of the upper and lower envelopes to obtain the average envelope. : ; Step 53: [The text appears to be incomplete and contains several grammatical errors. A more accurate translation would require the full context.] and Subtract to obtain the remaining signal ; Step 54: For the remaining signal Repeat steps 51-53 until SD is less than the threshold value, and then stop to obtain the appropriate first-order modal components. The formula for calculating SD is: ; Step 55: [Regarding...] and By subtracting, we obtain the first-order residual. ,use Replace the original After repeating steps 51-55 n times, the nth-order mode function is obtained. and the final residual quantity that meets the standard Thus obtain The expression after EMD decomposition is: ; Step 56: Reconstruct the low-frequency components after EMD decomposition into a trend term, and use the sum of the other high-frequency components and the residual as the fluctuation term; Step 57: Remove missing trend terms from the observed meteorological and carbon dioxide flux data NEE. Train and test the machine learning model based on the trend terms after EMD decomposition. Then, input the influencing factor data corresponding to the missing values into the trained machine learning model to imput the trend terms. Finally, superimpose the imputed trend terms with the fluctuation terms corresponding to the trend terms to complete the imputation of NEE.
5. The method according to claim 1, characterized in that: The influencing factor data include: air temperature, shortwave radiation, precipitation, saturated vapor pressure difference, wind speed, soil temperature, soil moisture content, normalized difference vegetation index, and enhanced vegetation index.