Multi-source heterogeneous data fusion multi-scale soil carbon storage prediction method and device

The multi-scale soil carbon storage prediction method, which integrates multi-source heterogeneous data, solves the problems of insufficient scale adaptability and inadequate data utilization in existing technologies, and achieves high-precision cross-scale carbon storage prediction, supporting accurate decision-making in climate change research and land management.

CN122241455APending Publication Date: 2026-06-19NORTHWEST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWEST UNIV
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing methods for predicting soil carbon storage suffer from insufficient scale adaptability and inadequate utilization of multi-scale data, resulting in high uncertainty in prediction results and making it difficult to meet the precise needs of climate change research and land management at different scales.

Method used

A multi-scale soil carbon storage prediction method based on the fusion of multi-source heterogeneous data includes scale adaptation and standardization processing. It constructs a micro-anchoring layer, an intermediate connecting layer, and a macro-constraint layer. Combining a linear mixed-effects model and a random forest model, a multi-scale coupled model is formed to achieve the synergistic utilization of multi-scale data.

Benefits of technology

It significantly reduces the uncertainty of prediction results, improves the accuracy and applicability of cross-scale carbon storage prediction, and meets the precision needs of climate change research and land management.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241455A_ABST
    Figure CN122241455A_ABST
Patent Text Reader

Abstract

This application discloses a multi-scale soil carbon storage prediction method and apparatus based on multi-source heterogeneous data fusion. The method includes: performing scale adaptation and standardization processing on multi-source heterogeneous data to obtain standard multi-source heterogeneous data at different scales; training a linear mixed-effects model using the plot-scale standard multi-source heterogeneous data to obtain a micro-anchoring layer; training a random forest model using the small watershed-scale standard multi-source heterogeneous data to obtain an intermediate connecting layer; training the random forest model using the regional-scale standard multi-source heterogeneous data and integrating spatial interpolation to obtain a macro-constraint layer; coupling the intermediate connecting layer, micro-anchoring layer, and macro-constraint layer to obtain a multi-scale coupled model and training it to obtain a multi-scale carbon storage prediction model; inputting the multi-source heterogeneous data of the area to be predicted into the multi-scale carbon storage prediction model to obtain the predicted carbon storage. This method can reduce the uncertainty of carbon storage prediction and improve the accuracy and applicability of cross-scale carbon storage prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of carbon storage prediction technology, and in particular to a multi-scale soil carbon storage prediction method and apparatus based on multi-source heterogeneous data fusion. Background Technology

[0002] Soil carbon, a core component of the global carbon cycle, comprises organic carbon (SOC) and inorganic carbon (SIC). Accurate prediction of soil carbon is crucial for climate change research and sustainable land management. Soil is not only an important carbon sink, but organic carbon is also a core indicator of soil quality, directly affecting soil fertility, structural stability, and water retention capacity, which are vital for maintaining ecosystem functions. Furthermore, increasing soil organic carbon can partially offset rising atmospheric CO2 concentrations, representing an important pathway to mitigating climate change. However, soil carbon distribution exhibits significant spatial scale dependence (from micro-plots to macro-regions), and its dynamic changes are comprehensively regulated by multiple factors, including climate, topography, erosion-deposition processes, land use patterns, and soil properties, leading to complex challenges in the accurate quantification and prediction of carbon storage.

[0003] Currently, various mature technical methods have been developed both domestically and internationally for estimating and predicting soil carbon storage. These methods mainly fall into the following categories: First, the soil type method, which estimates storage based on the carbon density characteristics of different soil types and the soil distribution area; second, the GIS estimation method, which uses geographic information system technology to integrate relevant spatial data of soil and completes spatial distribution prediction of carbon storage through spatial interpolation and other methods; third, the vegetation type method, which estimates carbon storage based on the carbon input characteristics of different vegetation types and the intrinsic relationship between vegetation cover and soil carbon storage; fourth, the model method, which simulates the input, decomposition, and fixation processes of soil carbon by constructing mathematical models to achieve dynamic prediction of storage; and fifth, the biozonation method, which estimates carbon storage based on the environmental characteristics of different biozonations and the correlation with soil carbon accumulation patterns. These methods all revolve around the need for soil carbon storage estimation, combining different data sources and technical means to form their respective application scenarios.

[0004] Existing methods for predicting soil carbon storage have significant limitations, with core shortcomings concentrated in two areas: insufficient scale adaptability and inadequate utilization of multi-scale data. Specifically, most existing technologies employ a single-scale prediction approach, failing to fully consider the spatial scale dependence of soil carbon distribution. They focus only on a specific scale (such as plot-level or regional scale), ignoring the differences in soil carbon distribution patterns and the changing weights of influencing factors at different scales. Furthermore, existing methods generally lack effective integration of multi-source heterogeneous data, particularly failing to fully utilize relevant data at different scales—for example, plot-level soil physicochemical property data, regional-level climate and land use data, and small-scale erosion-deposition process data. The failure to compensate for the limitations of single-scale data through the synergistic use of multi-scale data limits the applicability of prediction models, ultimately resulting in high uncertainty in soil carbon storage predictions and making it difficult to meet the precise needs of climate change research and land management at different scales. Summary of the Invention

[0005] This application provides a multi-scale soil carbon storage prediction method and apparatus by fusing multi-source heterogeneous data, which solves the problems of limited applicability and high uncertainty of existing carbon storage prediction methods.

[0006] In a first aspect, embodiments of this application provide a multi-scale soil carbon storage prediction method based on multi-source heterogeneous data fusion, including:

[0007] Multi-source heterogeneous data is scale-adapted and standardized to obtain standard multi-source heterogeneous data at different scales; wherein, the scales include plot scale, small watershed scale and regional scale.

[0008] A linear mixed-effects model was trained using standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer;

[0009] A random forest model is trained using standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer, and the validation set of the intermediate connecting layer consists of plot-scale data.

[0010] A random forest model is trained using standard multi-source heterogeneous data corresponding to the regional scale, and spatial interpolation is integrated to obtain a macro-constraint layer. The validation set of the macro-constraint layer consists of small watershed scale data.

[0011] Using the intermediate connecting layer as the hub, the micro-anchoring layer and the macro-constraint layer are embedded as sub-modules and coupled to obtain a multi-scale coupled model. The training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer are mixed as training datasets to train the multi-scale coupled model, thus obtaining a multi-scale carbon storage prediction model.

[0012] The multi-source heterogeneous data of the area to be tested are input into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage.

[0013] In conjunction with the first aspect, in one possible implementation, the multi-source heterogeneous data includes physical data, chemical data, biological data, vegetation data, and environmental data;

[0014] The physical data include soil moisture content, soil bulk density, clay and silt content, sand content, gravel content, and soil porosity.

[0015] The chemical data include organic carbon (SOC) concentration, inorganic carbon (SIC) concentration, pH value, conductivity, total nitrogen, total phosphorus, carbon-to-nitrogen ratio, dissolved organic carbon, mineral nitrogen, and available phosphorus.

[0016] The biological data includes microbial biomass carbon, microbial biomass nitrogen, microbial biomass phosphorus, extracellular enzyme activity, biomass-specific enzyme activity, enzyme stoichiometry, and microbial community structure.

[0017] The vegetation data includes vegetation biomass, vegetation coverage, normalized difference vegetation index, vegetation type, root biomass, and litter volume.

[0018] The environmental data includes temperature, precipitation, altitude, slope, slope position, watershed area, gully density, erosion modulus, and disaster intensity.

[0019] In conjunction with the first aspect, in one possible implementation, the scaling and standardization of multi-source heterogeneous data to obtain standardized multi-source heterogeneous data of different scales includes:

[0020] Complete the missing values ​​in the multi-source heterogeneous data and correct the outliers;

[0021] The multi-source heterogeneous data are classified according to their heterogeneity and labeled with corresponding scales;

[0022] Multi-source heterogeneous data is resampled hierarchically according to scale, and boundary calibration is performed on the resampled multi-source heterogeneous data;

[0023] Based on the classification of multi-source heterogeneous data, a standardization method that conforms to its data characteristics is used to standardize the data, resulting in standard multi-source heterogeneous data at different scales under multiple classifications.

[0024] In conjunction with the first aspect, one possible implementation also includes: establishing a spatiotemporal grid model to aggregate standard multi-source heterogeneous data at different scales, including:

[0025] The target region is discretized into an adaptive grid based on the scale of standard multi-source heterogeneous data;

[0026] The time step is determined based on the research period of the target area, so as to establish a spatiotemporal grid model based on the adaptive grid;

[0027] The spatial coordinates and time reference of the standard multi-source heterogeneous data are unified and mapped to the spatiotemporal grid model.

[0028] In conjunction with the first aspect, in one possible implementation, training a linear mixed-effects model using standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer includes:

[0029] Extract the plot scale data corresponding to the plot scale from the standard multi-source heterogeneous data;

[0030] The plot-scale data are grouped according to the soil use type of their sampling area, and the sample size in each group reaches the first number.

[0031] The subjective weights of plot-scale data in each group were determined using the analytic hierarchy process.

[0032] The objective weights of the plot-scale data in each group are determined based on the information entropy of the plot-scale data, and the weighting factors are determined in combination with the subjective weights.

[0033] Multiple predictive factors are selected from the plot-scale data of each group using the weighting factors.

[0034] A linear mixed-effects model is constructed based on multiple predictors, and the likelihood value of each linear mixed-effects model is determined by maximum likelihood estimation.

[0035] The optimal linear mixed-effects model is determined based on the likelihood value, and the optimal linear mixed-effects model is refitted by restricted maximum likelihood estimation to obtain the micro-anchoring layer.

[0036] In conjunction with the first aspect, in one possible implementation, the optimal linear mixed-effect model includes an optimal organic carbon prediction model and an optimal inorganic carbon prediction model;

[0037] The optimal organic carbon prediction model is as follows:

[0038] ;

[0039] The optimal inorganic carbon prediction model is as follows:

[0040] ;

[0041] In the formula, This represents the organic carbon concentration of the j-th sample in the i-th soil use type. This represents the intercept term, where m represents the total number of predictor factors selected in the parcel-scale data. This represents the regression coefficient of the k-th predictor. This represents the k-th predictor in the plot-scale data of the j-th sample in the i-th soil use type. This represents the random effect of land use type for the i-th soil use type. This represents the random error of the j-th sample in the i-th soil use type. This represents the inorganic carbon concentration of the j-th sample in the i-th soil use type. , They represent , The regression coefficients, This represents the clay and silt content of the j-th sample in the i-th soil use type. This represents the pH value of the j-th sample in the i-th soil use type.

[0042] In conjunction with the first aspect, in one possible implementation, the step of training a random forest model with standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer includes:

[0043] The first training set for constructing the random forest model includes small watershed-scale data extracted from the standard multi-source heterogeneous data, and the plot-scale data is weighted and expanded according to the land use ratio of the small watershed to serve as the first validation set.

[0044] Set the core parameters of the random forest model; among which, the core parameters include the number of decision trees, the number of variables in the decision tree splits, and the minimum number of samples per node;

[0045] The random forest model is trained using the first training set to identify key correction areas based on the deviation of the predicted values;

[0046] The correlation between clay content and organic / inorganic carbon in the key correction region was calculated to determine the carbon stability coefficient.

[0047] The predicted values ​​of the random forest model are corrected based on the carbon stability coefficient, and the random forest model is optimized using the first validation set to determine the intermediate connecting layer.

[0048] Based on the revised forecast values ​​and small watershed-scale data, the carbon storage of each layer in the key revised area is determined, and the carbon storage of each layer is summed to obtain the total profile storage.

[0049] In conjunction with the first aspect, in one possible implementation, the step of training a random forest model with standard multi-source heterogeneous data corresponding to the regional scale and integrating spatial interpolation to obtain a macro-constraint layer includes:

[0050] A second training set for the random forest model is constructed. The second training set includes regional scale data corresponding to the regional scale extracted from the standard multi-source heterogeneous data, and the small watershed scale data is weighted and expanded according to the regional land use ratio to serve as the second validation set.

[0051] Multiple core factors are selected from the regional-scale data based on the SHAP value, and factor weights are assigned to the corresponding core factors based on the SHAP value.

[0052] Set the core parameters of the random forest model and train the random forest model using the second training set;

[0053] Based on the second validation set, the core parameters of the random forest model were optimized using a grid search method.

[0054] The disaster intensity in the standard multi-source heterogeneous data is quantified to determine the disaster intensity index;

[0055] The predicted values ​​output by the macro-constraint layer are corrected using the disaster intensity index;

[0056] The carbon density of all grid cells within the prediction area is determined based on the corrected prediction values;

[0057] Co-kriging interpolation is used with normalized vegetation index as an auxiliary variable to improve the spatial continuity of the prediction area, thus obtaining the macro-constraint layer; wherein the output of the macro-constraint layer is a raster map of the organic / inorganic carbon storage of the prediction area.

[0058] Secondly, embodiments of this application provide a multi-scale soil carbon storage prediction device based on multi-source heterogeneous data fusion, comprising:

[0059] The data standardization module is used to perform scale adaptation and standardization processing on multi-source heterogeneous data to obtain standard multi-source heterogeneous data at different scales; wherein, the scales include plot scale, small watershed scale and regional scale.

[0060] The first training module is used to train a linear mixed-effects model with standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer.

[0061] The second training module is used to train a random forest model with standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer, and the validation set of the intermediate connecting layer is composed of plot-scale data.

[0062] The third training module is used to train a random forest model with standard multi-source heterogeneous data corresponding to the regional scale, and integrate spatial interpolation to obtain a macro-constraint layer, wherein the validation set of the macro-constraint layer is composed of small watershed scale data.

[0063] The coupling module is used to embed the micro-anchoring layer and the macro-constraint layer as sub-modules with the intermediate connecting layer as the hub, and couple them to obtain a multi-scale coupling model. The training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer are mixed as training datasets to train the multi-scale coupling model and obtain a multi-scale carbon storage prediction model.

[0064] The prediction module is used to input multi-source heterogeneous data of the area to be tested into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage.

[0065] Thirdly, embodiments of this application provide an apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein, when the processor executes the executable instructions, it implements the method as described in the first aspect or any possible implementation of the first aspect.

[0066] One or more technical solutions provided in the embodiments of this application have at least the following technical effects or advantages:

[0067] This application's embodiments, through scale adaptation and standardization of multi-source heterogeneous data, unify the format and spatial benchmark of data from different sources and scales, eliminating interference caused by data heterogeneity and laying a solid data foundation for the subsequent construction of multi-scale models. By constructing a hierarchical model with a micro-anchoring layer, an intermediate connecting layer, and a macro-constraint layer, and using the intermediate connecting layer as a hub to couple the models of each layer, it fully utilizes the accuracy of plot-scale data, the transitional nature of small watershed-scale data, and the macroscopic nature of regional-scale data, achieving the synergistic utilization of multi-scale data. By constructing a multi-scale coupled model, it effectively compensates for the limitations of single-scale prediction, significantly reduces the uncertainty of prediction results, and improves the accuracy and applicability of cross-scale carbon storage prediction. This not only meets the needs of climate change research for monitoring dynamic changes in carbon storage at different scales, but also provides scientific decision support for land management departments to formulate precise carbon sink enhancement strategies and optimize land use structure. Attached Figure Description

[0068] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0069] Figure 1 A flowchart illustrating the multi-scale soil carbon storage prediction method based on multi-source heterogeneous data fusion provided in this application embodiment;

[0070] Figure 2A schematic diagram of the structure of the multi-scale soil carbon storage prediction device for multi-source heterogeneous data fusion provided in the embodiments of this application;

[0071] Figure 3 Example diagrams of land parcel scale provided for embodiments of this application;

[0072] Figure 4 Example diagrams at the small watershed scale provided for embodiments of this application;

[0073] Figure 5 An example diagram at the regional scale provided for embodiments of this application. Detailed Implementation

[0074] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0075] The following description of some technologies involved in the embodiments of this application is provided to aid understanding and should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this application. Similarly, for clarity and brevity, some descriptions of well-known functions and structures are omitted in the following description.

[0076] Figure 1 This is a flowchart of a multi-scale soil carbon storage prediction method based on multi-source heterogeneous data fusion provided in this application embodiment, including steps 101 to 106. Figure 1 The execution order shown in this embodiment of the multi-scale soil carbon storage prediction method based on multi-source heterogeneous data fusion is merely one example and does not represent the only possible execution order for such a method. The execution order can be adjusted to achieve the desired final result. Figure 1 The steps shown can be performed in parallel or in reverse order.

[0077] Step 101: Perform scale adaptation and standardization processing on the multi-source heterogeneous data to obtain standard multi-source heterogeneous data at different scales; among which, the scales include plot scale, small watershed scale and regional scale.

[0078] In this embodiment, the multi-source heterogeneous data includes physical data, chemical data, biological data, vegetation data, and environmental data. Physical data includes soil moisture content, soil bulk density, clay and silt content, sand content, gravel content, and soil porosity. Chemical data includes organic carbon (SOC) concentration, inorganic carbon (SIC) concentration, pH value, electrical conductivity, total nitrogen, total phosphorus, carbon-to-nitrogen ratio, dissolved organic carbon, mineral nitrogen, and available phosphorus. Biological data includes microbial biomass carbon, microbial biomass nitrogen, microbial biomass phosphorus, extracellular enzyme activity, biomass-specific enzyme activity, enzyme stoichiometry, and microbial community structure. Vegetation data includes vegetation biomass, vegetation cover, normalized difference vegetation index, vegetation type, root biomass, and litter volume. Environmental data includes temperature, precipitation, altitude, slope, slope position, watershed area, gully density, erosion modulus, disaster intensity, and land use type.

[0079] Among them, disaster intensity refers to the intensity of disturbance of mountain disasters to soil carbon cycle. Its core is to quantify the degree of damage of disasters to soil erosion-deposition process and carbon pool stability, and finally output a standardized disaster intensity index that can be directly used for calculation. Disaster intensity can be the number of disasters within the study period, converted to times / 10 years; it can also be the vector area of ​​the disaster area obtained by remote sensing interpretation; or it can be represented by vegetation destruction rate. When quantifying, it is expressed as the ratio of the difference between the mean of the pre-disaster normalized vegetation index and the mean of the post-disaster normalized vegetation index to the mean of the pre-disaster normalized vegetation index. This application exemplarily uses vegetation destruction rate to represent disaster intensity.

[0080] Multi-source heterogeneous data can be obtained by hierarchically sampling the target region according to scale during the research period to obtain multi-source heterogeneous data of the target region at different scales. Multi-source heterogeneous data can also come from open source data on official websites.

[0081] In this embodiment, multi-source heterogeneous data is scale-adapted and standardized to obtain standard multi-source heterogeneous data at different scales. This includes: completing missing values ​​in the multi-source heterogeneous data and correcting outliers; classifying the multi-source heterogeneous data according to its heterogeneity and labeling the corresponding scales; performing hierarchical resampling of the multi-source heterogeneous data according to scale and performing boundary calibration on the resampled multi-source heterogeneous data; and standardizing the multi-source heterogeneous data according to its classification and using a standardization method that conforms to its data characteristics to obtain standard multi-source heterogeneous data at different scales under multiple classifications.

[0082] Specifically, for a small number of missing values ​​in multi-source heterogeneous data (single indicator missing rate <5%), the KNN (K-nearest neighbor) interpolation method is used, with the same scale and land use type as neighborhood constraints (e.g., neighborhood radius ≤5m at the plot scale and neighborhood radius ≤5km at the regional scale). For a large number of missing values ​​in multi-source heterogeneous data (single indicator missing rate 5%-20%), a random forest infill model is constructed, and auxiliary factors with a correlation ≥0.6 are input (e.g., clay content and total nitrogen are used to fill missing values ​​of organic carbon). For missing key factors in multi-source heterogeneous data (e.g., erosion modulus), the scale substitution method is used, replacing the missing values ​​of erosion modulus at the plot scale with the mean of erosion modulus at the small watershed scale, and replacing the missing values ​​of erosion modulus at the small watershed scale with the spatial distribution of erosion modulus at the regional scale.

[0083] For outlier handling in multi-source heterogeneous data, this application uses a dual-standard method based on ecological rationality and statistical thresholds. Specifically, the statistical threshold is initially screened using box plots (IQR=1.5), while ecological rationality is corrected by incorporating carbon cycle patterns (e.g., abnormally high organic carbon values ​​need to be verified to determine if they are due to sample contamination, and abnormally low inorganic carbon values ​​need to exclude areas with high precipitation leaching). Then, outliers are replaced with the 95th percentile of the same scale and erosion type to avoid information loss caused by direct deletion.

[0084] Based on the heterogeneity of multi-source heterogeneous data, the multi-source heterogeneous data in this application are exemplarily divided into 5 categories, including continuous numerical data (such as organic carbon SOC concentration, inorganic carbon SIC concentration, clay and silt content, covering the entire scale, obtained through actual measurement, with different dimensions and large differences), discrete numerical data (such as erosion modulus, covering small watershed scale and regional scale, nonlinear distribution, high proportion of extreme values), categorical data (such as land use type, slope position, covering the entire scale, no quantitative value, semantic differences), process data (such as vegetation biomass, normalized vegetation index, covering plot scale and regional scale, strongly correlated with carbon cycle process, and dynamically changing), and spatial data (such as slope, altitude, covering the entire scale, with large differences in resolution).

[0085] To address the resolution differences in multi-source heterogeneous data at different scales, this application employs hierarchical resampling and boundary calibration to ensure spatial consistency of the data. Specifically, for high-resolution multi-source heterogeneous data (e.g., from a 30m DEM (slope / elevation) to a plot-scale resolution of 1m), bilinear interpolation is used for resampling to preserve micro-topographic details (e.g., slope inflection points). Then, using actual plot-scale slope records as constraints, the interpolated slope classification error is corrected (e.g., reducing the boundary error between gentle and steep slopes to within 1m). For low-resolution multi-source heterogeneous data (e.g., from a 1km MAT to a small watershed resolution of 200m), co-kriging interpolation is used, introducing a high-resolution normalized vegetation index as an auxiliary variable to improve the spatial continuity of climate data. Scale correction uses measured meteorological station data at the small watershed scale (e.g., regional automatic weather station MAT) for bias calibration, using the formula: MAT... 修正 =MAT 插值 ×k+b, where k is the scale correction coefficient (1.02-1.05 for small watershed scale) and b is the bias compensation value (obtained by regression based on measured values).

[0086] Furthermore, based on the administrative boundaries at the regional scale / boundaries at the watershed scale, multi-source heterogeneous data at the plot scale and small watershed scale can be cropped and spliced ​​to ensure that the spatial scope is free of overlap and omission.

[0087] Furthermore, based on the classification of multi-source heterogeneous data, standardization methods that conform to their data characteristics are adopted for standardization processing to preserve their ecological significance.

[0088] Specifically, for continuous numerical multi-source heterogeneous data, this application introduces factor contribution weights (determined based on previous literature and preliminary experiments, reflecting the strength of the factor's influence on carbon storage), as shown in the following formula:

[0089] ,

[0090] In the formula, This represents standardized multi-source heterogeneous data. This represents the original, multi-source, heterogeneous data. Indicates the factor contribution weight. This represents the mean of multi-source heterogeneous data of the same type at the same scale. It represents the standard deviation of multi-source heterogeneous data of the same type at the same scale.

[0091] For discrete numerical multi-source heterogeneous data characterized by nonlinearity and numerous extrema, this application avoids the extrema-dominated nature caused by direct standardization and adopts a standardization method of logarithmic transformation and piecewise division. Specifically, the multi-source heterogeneous data is subjected to logarithmic transformation... 10(X+1) transformation (+1 avoids the meaninglessness of 0 values), weakening the influence of extreme values, then: divide the logarithmically transformed multi-source heterogeneous data into 3 segments according to ecological thresholds (e.g., divided by erosion modulus: low indicates erosion modulus less than 5000 t km). -2 yr -1 The value in the figure indicates an erosion modulus between 5000 and 15000 t / km. -2 yr -1 "High" indicates an erosion modulus greater than 15000 t km. -2 yr -1 Each segment is individually Z-score standardized.

[0092] For categorized multi-source heterogeneous data, this application assigns quantitative values ​​to classification indicators based on ecological significance (e.g., slope position: erosion zone = 1, sedimentation zone = 2, undisturbed zone = 3; land use: cultivated land = 1, grassland = 2, forest = 3). Then, the quantified categorized multi-source heterogeneous data is converted into binary vectors (e.g., slope position sedimentation zone is encoded as [0,1,0]) to avoid priority misjudgment caused by the size of classification values ​​and to correct the classification semantics at different scales (e.g., at the regional scale, "cultivated land" includes "terraced fields + sloping cultivated land", which needs to be encoded as [1,1,0] to reflect the characteristics of composite land use).

[0093] For process-oriented, multi-source heterogeneous data, which are strongly correlated with carbon cycle processes, it is necessary to preserve their dynamic characteristics. This application introduces a microbial activity correction coefficient (based on plot-scale measured data, f=EEA / MBC), with the standardized formula as follows:

[0094] ,

[0095] In the formula, This represents standardized multi-source heterogeneous data. This represents the original, multi-source, heterogeneous data. This represents the microbial activity correction coefficient, reflecting microbial metabolic strategies (such as those under stress). =0.3, under normal conditions =0.1).

[0096] The normalized vegetation index is calculated by normalizing vegetation cover and calibrating it using measured vegetation cover, as shown in the following formula:

[0097] ,

[0098] In the formula, This represents the standardized normalized vegetation index. This represents the original normalized vegetation index. , This represents the maximum and minimum values ​​in the original normalized vegetation index. This represents the measured value of vegetation cover, thus preserving the correlation between the normalized vegetation index and the actual vegetation cover.

[0099] For spatial multi-source heterogeneous data, standardization is not required here as the above-mentioned unified spatial coordinates, hierarchical resampling, and boundary calibration are already in place.

[0100] In addition, a spatiotemporal grid model can be established to aggregate standard multi-source heterogeneous data at different scales, including: discretizing the target region into an adaptive grid according to the scale of the standard multi-source heterogeneous data; determining the time step according to the research period of the target region to establish a spatiotemporal grid model based on the adaptive grid; and unifying the spatial coordinates and time reference of the standard multi-source heterogeneous data and mapping them to the spatiotemporal grid model.

[0101] Specifically, the adaptive grid discretization in this application adopts a multi-scale grid division strategy under terrain constraints. At the plot scale, a 1m×1m square grid is generated centered on the actual sampling point to ensure that each grid covers a single sampling unit. At the small watershed scale, 200m×200m polygonal grids are divided based on the valley lines and ridge lines extracted by the digital elevation model (DEM) as boundaries to avoid grids crossing hydrological units within the watershed. At the regional scale, regular 1km×1km grids are divided based on administrative boundaries or large watershed boundaries to balance spatial coverage and computational efficiency. The determination of the time step needs to match the temporal resolution of the data and the research objectives: if the research period is 5 years and interannual variation is the focus, the time step is set to 1 year; if seasonal dynamics need to be analyzed, the time step is divided into quarters (3 months), and monthly meteorological data (such as temperature and precipitation) are aggregated to the quarterly step through arithmetic mean. During the data mapping process, a layered filling method is adopted for data at different scales: plot-scale data is directly assigned to the corresponding 1m grid cell; small watershed-scale data is filled to 200m grid cells through spatial interpolation (such as the inverse distance weighting method), and bias correction is performed by combining the statistical characteristics of plot data within the watershed (such as the mean SOC); regional-scale data is mapped to 1km grid cells using co-kriging, with small watershed data as an auxiliary variable. Simultaneously, a spatiotemporal index table is established to record the data source and data quality labels (such as measured, interpolated, and aggregated) for each grid cell at different time steps, ensuring that the data source can be traced during subsequent model training.

[0102] Those skilled in the art should realize that the specific scales of plot scale, small watershed scale and regional scale can be set according to actual needs. The data of different scales given in this application are only an embodiment of this application and are not intended to limit the scope of protection of this application.

[0103] Step 102: Train a linear mixed-effects model using standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer. In this embodiment, plot-scale data corresponding to the plot scale is extracted from the standard multi-source heterogeneous data; the plot-scale data is grouped according to the soil use type of its sampling area, and the sample size in each group reaches a first quantity; the subjective weights of the plot-scale data in each group are determined by the analytic hierarchy process (AHP); the objective weights of the plot-scale data in each group are determined based on the information entropy of the plot-scale data, and weight factors are determined in combination with the subjective weights; multiple predictive factors are screened from the plot-scale data of each group using the weight factors; a linear mixed-effects model is constructed based on the multiple predictive factors, and the likelihood value of each linear mixed-effects model is determined using maximum likelihood estimation; the optimal linear mixed-effects model is determined based on the likelihood value, and the optimal linear mixed-effects model is refitted using restricted maximum likelihood estimation to obtain a micro-anchoring layer.

[0104] Specifically, the first quantity is set to ≥8 to meet the basic statistical requirements for sample size in the linear mixed-effects model. If the sample size for a certain soil use type group is insufficient, it is supplemented by the similarity of adjacent plots in the same area (e.g., sample expansion based on soil texture and topographic similarity ≥0.8). When determining subjective weights using the analytic hierarchy process (AHP), a three-layer structure is constructed: "target layer (carbon storage prediction accuracy) - criterion layer (physical / chemical / biological / vegetation / environmental data categories) - indicator layer (specific factors)". Five to seven experts in the field of soil carbon cycling are invited to conduct pairwise comparisons and scores of the indicators at each layer, generating a judgment matrix and passing the consistency test (CR<0.1) to obtain the subjective weights of each factor. When screening predictive factors based on weighted factors, the top 20% of predictive factors by weight are retained, while multicollinearity is excluded by the variance inflation factor (VIF<5), ultimately determining 10-15 predictive factors (such as organic carbon concentration, clay content, microbial biomass carbon, normalized vegetation index, etc.). In constructing the linear mixed-effects model, land parcel carbon storage is used as the dependent variable, key factors are fixed effects, and spatial clustering of the sampling area (such as topographic units) is used as a random effect. Maximum likelihood estimation (ML) is used to calculate the model likelihood values ​​for different random effect structures, and the candidate model with the highest likelihood value is selected. Then, restricted maximum likelihood estimation (REML) is used to refit the model to eliminate the influence of fixed effects on random effects, and finally, a micro-anchoring layer is obtained, which can output the predicted value of land parcel-scale carbon storage and the 95% confidence interval.

[0105] The fixed effects component of the optimal linear mixed-effects model characterizes the common driving forces of carbon storage under different soil use types, while the random effects component captures the spatial heterogeneity among samples within the same type (such as differences in plot micro-topography and tillage practices). After model refitting, plot-scale predictors (clay and silt content, pH, etc.) are input into the model to obtain point estimates of organic carbon (SOC) and inorganic carbon (SIC) storage for each plot sample, and output 95% confidence intervals. For samples with a relative error of more than 15% in the confidence interval, secondary verification is performed using soil profile records from field sampling to eliminate abnormal results caused by sample contamination or sampling errors. Simultaneously, the micro-anchoring layer generates a contribution matrix for each predictor, with clay and silt content contributing an average of 32%, providing a priority basis for selecting core factors in subsequent small watershed-scale models. Furthermore, this layer can output a spatial distribution heatmap of carbon storage at the plot scale, visually displaying the differences in carbon storage in local areas.

[0106] For example, the micro-anchoring layer is iterated 1000 times and the convergence threshold is 1e-6.

[0107] In this embodiment of the application, the optimal linear mixed effect model includes an optimal organic carbon prediction model and an optimal inorganic carbon prediction model;

[0108] The optimal organic carbon prediction model is as follows:

[0109] ;

[0110] The optimal inorganic carbon prediction model is as follows:

[0111] ;

[0112] In the formula, This represents the organic carbon concentration of the j-th sample in the i-th soil use type. This represents the intercept term, where m represents the total number of predictor factors selected in the parcel-scale data. This represents the regression coefficient of the k-th predictor. This represents the k-th predictor in the plot-scale data of the j-th sample in the i-th soil use type. This represents the random effect of land use type for the i-th soil use type. This represents the random error of the j-th sample in the i-th soil use type. This represents the inorganic carbon concentration of the j-th sample in the i-th soil use type. , They represent , The regression coefficients, This represents the clay and silt content of the j-th sample in the i-th soil use type. This represents the pH value of the j-th sample in the i-th soil use type.

[0113] The micro-anchoring layer can output factor regression coefficients, i.e. the influence coefficients of each fixed effect variable on organic or inorganic carbon density; the erosion-deposition carbon difference, i.e. the difference in carbon density between the erosion area and the deposition area, can reflect the magnitude of the influence of the erosion-deposition process on the carbon pool; and the land use random effect variance can reflect the random influence of different land use types on carbon density. The larger the variance, the more significant the influence.

[0114] Step 103: Train a random forest model using standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer. The validation set of the intermediate connecting layer consists of plot-scale data. In this embodiment, a first training set for the random forest model is constructed. The first training set includes small watershed-scale data extracted from the standard multi-source heterogeneous data, and the plot-scale data is weighted and expanded according to the land use ratio of the small watershed to serve as the first validation set. The core parameters of the random forest model are set. The core parameters include the number of decision trees, the number of variables in the decision tree splits, and the minimum number of samples per node. The random forest model is trained using the first training set to identify key correction areas based on the deviation of the predicted values. The correlation between clay content and organic / inorganic carbon in the key correction areas is calculated to determine the carbon stability coefficient. The predicted values ​​of the random forest model are corrected based on the carbon stability coefficient, and the random forest model is optimized using the first validation set to determine the intermediate connecting layer. The carbon storage of each layer in the key correction areas is determined based on the corrected predicted values ​​and the small watershed-scale data. The carbon storage of each layer is summed to obtain the total profile storage.

[0115] Specifically, the first training set uses a small watershed scale (exemplarily set to 30m, such as...). Figure 4The data (shown) encompasses standardized topographic factors (slope, aspect, elevation), vegetation factors (quarterly normalized vegetation index mean), meteorological factors (annual mean temperature, annual precipitation), soil property factors (spatial interpolation results of surface organic carbon density), and land use type coding vectors. The first validation set was constructed as follows: based on the area proportion of each land use type within the small watershed, weights were assigned to the corresponding type of plot-scale carbon storage samples (e.g., if cultivated land accounts for 40%, the weight of the plot sample for that type is 0.4). Weighted aggregation was then used to generate a validation sample set matching the spatial units of the small watershed data. Regarding core parameter settings, the number of decision trees was 1000, the maximum depth was 15, the minimum number of sample splits was 5, and the minimum number of sample leaf nodes was 2 to avoid overfitting. During training, out-of-bag (OOB) estimation was used to estimate generalization ability, and the OOB error needed to be controlled within 12%. When initializing the intermediate connecting layer, the factor regression coefficients of the micro-anchoring layer were used as initial weights to ensure adherence to the micro-carbon cycle pattern. Five-fold cross-validation training was employed, iteratively optimizing the parameters.

[0116] After model training, key correction areas are identified by calculating the prediction residuals (the difference between actual and predicted values): grid cells with residual absolute values ​​exceeding twice the standard deviation of the mean are marked as key areas. These areas are mostly concentrated in areas with high data heterogeneity, such as steep-slope farmland and forest-grassland ecotones. For key areas, spatial data of clay content at the watershed scale are extracted, and Pearson correlation analysis is performed with measured organic / inorganic carbon reserves at the plot scale: if clay content is significantly positively correlated with organic carbon (r>0.6, p<0.05), the carbon stability coefficient k=1+0.2×clay content proportion (clay proportion=clay / (clay + silt + sand)); if it is significantly negatively correlated with inorganic carbon (r<-0.5, p<0.05), k=1-0.15×clay proportion. After correcting the predicted values ​​using k, the model is optimized using the first validation set: the coefficient of determination R is calculated. 2 With the root mean square error RMSE, if R 2 ≥0.85 and RMSE≤0.6kg / m 2 It is then determined to be an intermediate connecting layer.

[0117] Finally, the intermediate connecting layer outputs 30m resolution raster (TIFF format) of organic carbon density and inorganic carbon density at the watershed scale, as well as the total carbon storage and average carbon density of each watershed.

[0118] Step 104: Train a random forest model using standard multi-source heterogeneous data corresponding to the regional scale, and integrate spatial interpolation to obtain a macro-constraint layer. The validation set of the macro-constraint layer consists of small watershed scale data. In this embodiment, a second training set for the random forest model is constructed. The second training set includes regional-scale data extracted from standard multi-source heterogeneous data, and the small watershed-scale data is weighted and expanded according to the regional land use ratio to serve as a second validation set. Multiple core factors are selected from the regional-scale data based on SHAP values, and factor weights are assigned to the corresponding core factors based on SHAP values. Core parameters of the random forest model are set, and the random forest model is trained using the second training set. Based on the second validation set, the core parameters of the random forest model are optimized using a grid search method. The disaster intensity in the standard multi-source heterogeneous data is quantified to determine the disaster intensity index. The predicted values ​​output by the macro-constraint layer are corrected using the disaster intensity index. The carbon density of all grids within the prediction area is determined based on the corrected predicted values. Co-kriging interpolation is used, with the normalized vegetation index as an auxiliary variable, to improve the spatial continuity of the prediction area, resulting in the macro-constraint layer. The output of the macro-constraint layer is an organic / inorganic carbon storage raster map of the prediction area.

[0119] Specifically, the region scale of the second training set (exemplarily set to 1km, such as...) Figure 5 The data shown includes standardized macro-topographic factors (topographic relief, average slope), vegetation factors (annual average normalized vegetation index, seasonal normalized vegetation index change rate), meteorological factors (annual mean temperature, annual precipitation, seasonal mean humidity), soil property factors (regional soil type code, surface organic carbon density survey value), and land use / cover change (LUCC) data. The second validation set is constructed as follows: based on the area proportion of each small watershed within the region, the carbon storage samples of the corresponding small watersheds are assigned weights (e.g., if a small watershed accounts for 15% of the regional area, its carbon storage sample weight is 0.15), and after weighted aggregation, a validation sample set matching the 1km grid unit of the region is generated.

[0120] When selecting core factors based on SHAP values, the global importance score of each feature is first calculated. Factors with the top 70% cumulative scores are retained, ultimately determining 8-10 core factors (such as annual normalized vegetation index, annual precipitation, soil type dummy variable, topographic relief, and cultivated land percentage). SHAP factor weights are directly calculated using the percentage of global importance of each factor's SHAP score; for example, annual normalized vegetation index accounts for 22%, annual precipitation for 18%, and soil type for 15%.

[0121] The core parameters of the random forest model are set as follows: the number of decision trees is set to 1000 to adapt to the large amount of data at the regional scale; the maximum depth is 12; the minimum number of sample splits is 10; and the minimum number of leaf nodes is 5. The carbon density at 1km intervals in the region is used as the dependent variable, and macroeconomic factors are used as independent variables to extract macroeconomic trends such as climate-carbon relationships and erosion intensity thresholds. The training process uses 5-fold cross-validation, and the parameters are optimized through grid search. The search range includes the number of decision trees (500-1500) and the number of split variables (1 / 4 to 1 / 2 of the total number of features). The goal is to maximize the coefficient of determination R on the validation set. 2 (Requires ≥0.8) and minimizes the root mean square error (RMSE) (Requires ≤0.8 kg / m 2 ).

[0122] In terms of disaster intensity quantification, a disaster intensity index is constructed for extreme climate events (drought, flood) and human activities (large-scale conversion of farmland to forest, industrial and mining development) within the region. Drought events are weighted by the duration of the Standardized Precipitation Evapotranspiration Index (SPEI) ≤ -1.5 for three consecutive months; flood events are weighted by the number of days with cumulative precipitation exceeding 1.2 times the annual precipitation; and human activities are weighted by the proportion of farmland converted to industrial and mining land in the LUCC (Local Area Concentration Area). The index ranges from 0 to 1, where 0 indicates no impact and 1 indicates severe impact. When correcting the predicted value, the formula is: Corrected value = Original predicted value × (1 - Disaster Intensity Index × 0.15). For example, if the disaster intensity index of a certain grid is 0.6, then the corrected value is 91% of the original predicted value.

[0123] The co-kriging interpolation step uses carbon density output from the random forest model as the primary variable and normalized vegetation index (NVI) as an auxiliary variable (because NVI is significantly positively correlated with carbon storage, r>0.7, p<0.01). A Gaussian semivariogram model is set, with a search radius of 5 km, resulting in a continuous carbon density raster with a resolution of 1 km after interpolation. Edge effect correction is applied to the region boundary raster after interpolation using a weighted average adjustment of the three nearest neighboring grids to ensure spatial continuity.

[0124] The final output of the macro-constraint layer includes: a 1km resolution raster map of organic / inorganic carbon storage (unit: kg / m²), statistics on total regional carbon storage (e.g., the total organic carbon storage in a county within the 0-200cm soil layer is 0.22 billion tons, and the inorganic carbon storage is 1.03 billion tons), SHAP importance heatmaps for each core factor, and a comparison table of average carbon storage for different land use types. In addition, a data quality assessment report is output, marking grid cells with a data missing rate exceeding 10%, providing direction for subsequent data supplementation.

[0125] Step 105: Using the intermediate connecting layer as the hub, embed the micro-anchoring layer and the macro-constraint layer as sub-modules to couple them into a multi-scale coupled model. Use the training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer as the training dataset to train the multi-scale coupled model and obtain a multi-scale carbon storage prediction model.

[0126] In this embodiment, the core driving mechanism of carbon storage in the micro-anchoring layer, intermediate connecting layer, and macro-constraint layer is consistent, all regulated by carbon input, erosion and transport, sedimentary stabilization, and mineralization decomposition. The differences lie only in factor weights, process intensity, and spatial heterogeneity at different scales. High-resolution measured data at the plot scale serve as micro-anchors at the small watershed or regional scales, while macro-pattern data at the regional scale represent trend constraints at the small watershed / plot scales. In other words, the core differences among these three different scales are scale resolution, process characterization granularity, and data type, rather than the core mechanism. Therefore, data from different layers can be fused at multiple resolutions through downscaling / upscaling. Thus, the micro-anchoring layer, intermediate connecting layer, and macro-constraint layer can be integrated into a unified model—a multi-scale coupled model—through multi-model coupling.

[0127] The multi-scale coupled model adopts a hierarchical modular architecture. Specifically, this application uses an intermediate connecting layer as the hub, embedding the micro-anchoring layer and macro-constraint layer as sub-modules. Integrated modeling across the three layers is achieved through a multi-resolution data fusion module, a scale transformation parameter module, and a process coupling core module, preserving the modeling advantages of each scale while realizing unified prediction across all scales. During the prediction process, the micro-anchoring layer anchors process parameters, the intermediate connecting layer integrates the entire link mechanism, and the macro-constraint layer achieves full-domain expansion. Simultaneously, the multi-resolution data fusion module, scale transformation parameter module, and process coupling core module achieve data unification, scale transformation, and trend constraints, ultimately directly outputting the predicted carbon storage at the plot, small watershed, and regional scales without requiring subsequent separate cross-scale connections.

[0128] Furthermore, the micro-anchoring layer uses the LMM model to fit high-resolution measured data from plot-scale data, outputting core process parameters (such as factor regression coefficients, erosion-deposition carbon difference, microbial metabolic parameters, etc.), providing micro-accuracy benchmarks and parameter initialization values ​​for the process coupling core module. The macro-constraint layer uses macro-data such as environmental data (climate data, topographic data), land use types, etc., to output macro-trend constraint rules (such as climate gradient carbon change rate, erosion intensity spatial threshold, macro-land use proportion), providing global trend boundaries and global spatial constraints for the process coupling core module. The intermediate connecting layer integrates the core process parameters of the micro-anchoring layer and the global trend boundaries and global spatial constraints of the macro-constraint layer. With the intermediate connecting layer as the core, it is the core computational unit of the multi-scale coupling model, outputting carbon storage at the small watershed scale, and providing a scale conversion basis for the micro-anchoring layer and the macro-constraint layer. The multi-resolution data fusion module is used to achieve integrated preprocessing (standardization, scale adaptation, missing value imputation, etc.) of high-resolution measured data at the plot scale, fused data at the small watershed scale, and low-resolution open-source data at the regional scale, providing a unified data input for all sub-modules / layers. Specifically, the multi-resolution data fusion module uses the small watershed scale as the data baseline scale, integrates multi-source heterogeneous data at the plot scale, small watershed scale, and regional scale, and completes data preprocessing through scale adaptation, factor classification, and standardization to obtain standard multi-source heterogeneous data, forming a multi-scale nested dataset (containing plot-scale data at the point scale, small watershed-scale data at the area scale, and regional-scale data at the raster scale, with completely unified data coordinate systems, indicator definitions, and calculation units), providing a unified input for other modules or layers. The scale transformation parameter module is used to construct a transformation parameter library (such as area weighting coefficient, process intensity correction coefficient, and spatial heterogeneity interpolation coefficient) for upscaling (plot to small watershed to region) / downscaling (region to small watershed to plot) based on the nested scale relationship, realizing scale transformation within the multi-scale coupled model. The process coupling core module is used to verify the numerical consistency, spatial coherence, and ecological logic consistency of the prediction results at three scales in real time within the multi-scale coupling model. If the deviation exceeds the threshold (such as total deviation > 5% or boundary difference > 8%), parameter correction is triggered.

[0129] Furthermore, to address the potential issue of large fusion errors in multi-resolution data within the coupled multi-scale model, this application uses a small watershed scale as the data baseline, performs upscaling aggregation on plot-scale data, downscaling refinement on regional-scale data, and introduces co-kriging interpolation (using high-resolution data as an auxiliary variable) to improve the spatial accuracy of low-resolution data. Regarding the issue of cross-scale process parameter conversion bias, this application constructs a scale conversion parameter library and sets process correction coefficients for multi-source heterogeneous data at different scales based on the process intensity observed in the field (such as erosion rate and deposition rate), avoiding parameter distortion caused by simple area weighting. To address the potential decrease in generalization ability of the coupled multi-scale model, this application adds a regularization term (such as L1 / L2 regularization) to the loss function of the core process coupling module, and employs cross-scale cross-validation (using plot-scale data to validate the output data of the intermediate connecting layer and small watershed-scale data to validate the output data of the macro-constraint layer), which improves the generalization ability of the multi-scale coupled model.

[0130] During the model training phase, a transfer learning strategy was adopted, using the pre-trained parameters of the micro-anchoring layer, intermediate connecting layer, and macro-constraint layer as initial weights. The objective function was the weighted sum of cross-scale prediction errors (0.4 for plot scale, 0.3 for small watershed, and 0.3 for region). The AdamW algorithm was selected for optimization, with a learning rate of 1e-4 and 500 iterations, until the cross-scale average R-value of the validation set was obtained. 2 ≥0.88 and average RMSE≤0.55kg / m 2 .

[0131] In this embodiment, the loss function of the multi-scale carbon storage prediction model is as follows:

[0132] .

[0133] in, ,

[0134] ,

[0135] ,

[0136] In the formula, This represents the loss function of the multi-scale carbon storage prediction model. This represents the loss function of the intermediate connecting layer. The loss function represents the micro-anchoring layer. This represents the loss function of the macroscopic constraint layer. The loss weighting coefficient for the micro-anchoring layer, with a value range of [0.4, 0.6], is used to balance the influence of micro-parameters on the multi-scale carbon storage prediction model, ensuring that the multi-scale carbon storage prediction model conforms to the micro-laws of the carbon cycle. This represents the loss weight coefficient of the macro-constraint layer, with a value of [0.2, 0.4]. It is used to balance the constraints of macro-trends on the multi-scale carbon storage prediction model and to avoid deviations of meso- and micro-level predictions from the regional global patterns. This indicates the sample size for a single batch of input. This represents the i-th sample in a single batch of input samples. This represents the measured value of carbon density (specifically, measured carbon density data at the small watershed or regional scale). express, The factor regression coefficients represent the predicted values, indicating the influence coefficients of each fixed-effect variable on carbon density as predicted by the multi-scale carbon storage prediction model. These coefficients are used to ensure consistency of micro-parameters. The factor regression coefficients representing the output of the micro-anchoring layer, This represents the predicted value of the erosion-deposition carbon difference, which refers to the difference in organic carbon density between the sedimentary and erosion zones as predicted by a multi-scale carbon storage prediction model. This represents the measured value of the erosion-deposition carbon difference, calculated from measured data at the plot scale. This represents the upper limit of carbon density, output by the macroscopic constraint layer, and is the maximum value of carbon density at the regional scale. This represents the lower limit of carbon density, output by the macroscopic constraint layer, and is the minimum value of carbon density at the regional scale. This represents the L2 norm, used to calculate the bias.

[0137] The multi-scale carbon storage prediction model was validated using an independent cross-scale sample set, including 1000 measured samples from plot profiles, 50 small watershed grid validation samples, and 10 regional survey unit samples. The validation results must meet the following requirements: plot scale R... 2 ≥0.90, RMSE≤0.4kg / m2; Small watershed scale R 2 ≥0.85, RMSE≤0.5kg / m 2 ; Regional scale R 2 ≥0.80, RMSE≤0.6kg / m 2 If the target is not met, the parameters of each submodule or layer will be adjusted retrospectively (such as the calculation method of the carbon stability coefficient of the intermediate connecting layer and the disaster intensity correction formula of the macro-constraint layer), and the coupling training will be carried out again.

[0138] The resulting multi-scale carbon storage prediction model supports seamless switching between plot profile carbon storage, small watershed grid distribution, regional raster map and total statistics, and can output factor contribution heatmaps and ranking tables at each scale, comparing the differences in the influence of factors such as clay content and vegetation cover at different scales.

[0139] For example, for a specific county-level region, a multi-scale carbon storage prediction model can output: the total carbon storage of a certain cultivated land profile (0-200cm) at the plot scale is 43.9 kg / m³. 2 At the small watershed scale, the average carbon storage in the northern forest area of ​​the county is 41.9 kg / m³. 2 The total carbon storage in the county at the regional scale is 125 million tons.

[0140] Step 106: Input the multi-source heterogeneous data of the area to be tested into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage. In this embodiment, the multi-source heterogeneous data of the area to be tested needs to undergo unified preprocessing first, and be processed into standard multi-source heterogeneous data according to step 101. Furthermore, topographic factors (slope, altitude, etc.) can be normalized, vegetation factors (normalized vegetation index sequence) can be smoothed and denoised using time series methods, meteorological factors (annual mean temperature, precipitation) can be spatially interpolated and completed, soil attribute factors (clay content, organic carbon density) can be uniformly converted to standard units, and land use types can be encoded and mapped to ensure complete matching with the feature space of the training dataset. After inputting into the multi-scale carbon storage prediction model, the model automatically identifies the scale attributes of the multi-source heterogeneous data of the benchmark bricks: if the input includes measured profile data at the plot scale, it first calls the accurate parameters of the micro-anchoring layer and outputs the stratified carbon storage at the plot scale in combination with the scale transformation logic; if the input is grid data at the small watershed scale, it uses the intermediate connecting layer as the core and integrates the regional trend of the macro-constraint layer for correction; if the input is raster data at the regional scale, it directly triggers the algorithm process of the macro-constraint layer and uses the small watershed samples of the intermediate connecting layer for error calibration.

[0141] For example, this application uses the Loess Plateau as the area to be tested. The input data includes: measured organic carbon / inorganic carbon values ​​of 80 plot profiles (plot scale) (covering four land use types: sloping farmland, grassland, forest, and dammed land). Each plot profile includes indicators such as organic carbon concentration, inorganic carbon concentration, clay and silt content, and pH value at five depths: 0-10cm, 10-20cm, 20-50cm, 50-100cm, and 100-200cm. It also includes 30m resolution small watershed topographic (small watershed scale) vegetation data (including slope, slope position, gully density, normalized difference vegetation index, land use type code, etc.), covering three typical small watersheds (total area approximately 36 km²). 2 Regional meteorological and soil survey data with a resolution of 1 km (regional scale) (average annual temperature 10.9℃, average annual precipitation 5024 mm, average annual erosion modulus approximately 8300 t / km²). -2 yr -1 The soil type is mainly loess, with an average calcium carbonate content of 120 g / kg. -1 (etc.), covering a total area of ​​approximately 2950 km².2 .

[0142] Taking a profile of sloping farmland around a silt-retaining dam area as an example, the results of inputting standard multi-source heterogeneous data at the plot scale into a multi-scale carbon storage prediction model are shown in Table 1 below.

[0143] Table 1 Output data table corresponding to plot size

[0144]

[0145] As shown in Table 1 above, the carbon pool structure is dominated by inorganic carbon, with a total carbon storage of 43.8 kg / m³. 2 In the Loess Plateau, inorganic carbon accounts for 84.5% of the total carbon storage, far exceeding organic carbon storage. This aligns with the arid and semi-arid climate of the Loess Plateau, where scarce rainfall leads to limited vegetation input and insufficient organic carbon accumulation. Meanwhile, inorganic carbon compounds such as calcium carbonate in the soil remain stable over a long period, becoming the main body of the carbon pool.

[0146] On the other hand, this plot of land is subject to severe erosion (annual soil erosion modulus of approximately 8300 t km²). -2 yr -1 The clay content in the 0-20cm layer is only 14% (lower than in the silt-dam deposition area). Soil particle coarsening weakens the physical protective effect of organic carbon, resulting in lower surface organic carbon reserves compared to similar land use types in non-erosion areas. While the 14% clay content in the 0-20cm layer is the highest value in this profile, it is generally low, limiting its adsorption and fixation capacity for organic carbon. This makes organic carbon easily lost through erosion. Inorganic carbon, existing in the form of calcium carbonate, is less affected by particle composition and exhibits greater stability. The soil pH is 8.3 (strongly alkaline), which inhibits the mineralization and decomposition of organic carbon while promoting carbonate precipitation, further solidifying the dominant position of inorganic carbon.

[0147] The input data includes a small watershed (area 8.3 km²) containing a silt-retaining dam. 2 Taking a small watershed-scale standard multi-source heterogeneous data as an example, the output results after processing the multi-scale carbon storage prediction model are divided according to land use type, as shown in Table 2 below.

[0148] Table 2 Output data for small watershed scale

[0149]

[0150] As shown in Table 2 above, the carbon reserves of this profile are 184.9 kg / m³. 2The carbon storage capacity of different land use types differed by up to 26.1%, reflecting the strong regulatory effect of land use on the carbon pool. The carbon storage capacity of the dam area was significantly higher than that of other areas, with organic carbon storage being 1.56 times that of sloping farmland and inorganic carbon storage being 1.26 times that of sloping farmland. This verifies that the dam area experienced fine particle enrichment (clay content reaching 21%) during sediment deposition, which provided physical protection for organic carbon and created conditions for carbonate (inorganic carbon) precipitation.

[0151] On the other hand, the above data conforms to the complete chain of erosion and loss of sloping farmland, gully transport, and siltation by dams within the small watershed. This results in sloping farmland experiencing significant loss of organic carbon and fine particles due to severe erosion, leading to the lowest organic carbon storage. Dams, on the other hand, intercept eroded organic-rich sediment, resulting in the highest organic carbon storage. Furthermore, the vegetation cover of forest land and grassland (with mean normalized vegetation index of 0.45 and 0.38, respectively) is higher than that of sloping farmland (with a mean normalized vegetation index of 0.22). This is because forest land and grassland increase organic carbon input through fallen leaves and root exudation, while reducing soil erosion and carbon loss, thus resulting in higher carbon storage.

[0152] Standardized multi-source heterogeneous data, obtained from meteorological data and soil survey data at the Loess Plateau regional scale, were used to input 12 key factors (clay and silt content, erosion modulus, normalized vegetation index, annual precipitation, pH value, land use type, etc.) from the standardized multi-source heterogeneous data into a multi-scale carbon storage prediction model. Based on regional constraints, the multi-scale carbon storage prediction model outputs the carbon storage of each 1km grid at each resolution. The regional-scale carbon storage is obtained by summing the carbon storage of 2950 effective grids one by one, resulting in a total carbon storage of 125 million tons in the 0-200cm region, of which 22 million tons are organic carbon and 102 million tons are inorganic carbon. This is consistent with the unique carbon pool of the arid and semi-arid region of the Loess Plateau, that is, carbon storage depends on inorganic carbon.

[0153] Based on the carbon storage data at the three scales mentioned above, it can be concluded that the proportion of inorganic carbon is much higher than that of organic carbon, the contribution of silt-retaining dams / forest land is the largest, and clay and erosion modulus are the core drivers. These characteristics are consistent with the carbon pool characteristics of the Loess Plateau in arid and semi-arid regions under strong erosion background, which is sufficient to demonstrate the effectiveness of the method in this application.

[0154] In addition, users can build a visualization interface for the multi-scale carbon storage prediction model, switch between carbon storage distribution maps at different scales, and view the contribution ranking of each factor, providing a basis for decision-making in subsequent carbon sink management.

[0155] While this application provides the method operation steps as described in the embodiments or flowcharts, more or fewer operation steps may be included based on conventional or non-inventive labor. The order of steps listed in this embodiment is merely one possible execution order among many and does not represent the only execution order. In actual device or client product execution, the methods shown in this embodiment or the accompanying drawings can be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment).

[0156] like Figure 2 As shown in the figure, this application embodiment also provides a multi-scale soil carbon storage prediction device 200 based on multi-source heterogeneous data fusion. The device includes: a data standardization module 201, a first training module 202, a second training module 203, a third training module 204, a coupling module 205, and a prediction module 206, as detailed below.

[0157] The data standardization module 201 is used to perform scale adaptation and standardization processing on multi-source heterogeneous data to obtain standard multi-source heterogeneous data at different scales; among which, the scales include plot scale, small watershed scale and regional scale.

[0158] The first training module 202 is used to train a linear mixed-effects model with standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer.

[0159] The second training module 203 is used to train a random forest model with standard multi-source heterogeneous data corresponding to a small watershed scale to obtain an intermediate connecting layer, and the validation set of the intermediate connecting layer consists of plot-scale data.

[0160] The third training module 204 is used to train a random forest model with standard multi-source heterogeneous data corresponding to the regional scale, and integrate spatial interpolation to obtain a macro-constraint layer. The validation set of the macro-constraint layer consists of small watershed scale data.

[0161] The coupling module 205 is used to embed the micro-anchoring layer and the macro-constraint layer as sub-modules with the intermediate connecting layer as the hub, and couple them to obtain a multi-scale coupling model. The training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer are mixed as training datasets to train the multi-scale coupling model and obtain a multi-scale carbon storage prediction model.

[0162] The prediction module 206 is used to input multi-source heterogeneous data of the area to be tested into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage.

[0163] Some modules in the apparatus described in this application can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, classes, etc., that perform a specific task or implement a specific abstract data type. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0164] The apparatus or module described in the above embodiments can be implemented by a computer chip or physical entity, or by a product with a certain function. For ease of description, the above apparatus is described by dividing it into various modules according to their functions. When implementing the embodiments of this application, the functions of each module can be implemented in one or more software and / or hardware. Of course, a module that implements a certain function can also be implemented by combining multiple sub-modules or sub-units.

[0165] The methods, apparatus, or modules described in this application can be implemented in a computer-readable program code manner. The controller can be implemented in any suitable manner, such as a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of a memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code manner, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included within it for implementing various functions can also be considered as structures within the hardware component. Alternatively, the device used to implement various functions can be viewed as either a software module that implements the method or a structure within a hardware component.

[0166] This application also provides an apparatus, the apparatus comprising: a processor; a memory for storing processor-executable instructions; wherein, when the processor executes the executable instructions, it implements the method described in this application.

[0167] This application also provides a non-volatile computer-readable storage medium storing a computer program or instructions thereon, which, when executed, enables the method described in this application embodiment to be implemented.

[0168] Furthermore, in the various embodiments of the present invention, each functional module can be integrated into a processing module, or each module can exist independently, or two or more modules can be integrated into a single module.

[0169] The aforementioned storage media include, but are not limited to, Random Access Memory (RAM), Read-Only Memory (ROM), Cache, Hard Disk Drive (HDD), or Memory Card. The memory can be used to store computer program instructions.

[0170] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary hardware. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product, or it can be embodied in the process of data migration. The computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, mobile terminal, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.

[0171] The various embodiments described in this specification are presented in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on its differences from other embodiments. All or part of this application can be used in numerous general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, mobile communication terminals, multiprocessor systems, microprocessor-based systems, programmable electronic devices, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices, etc.

[0172] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of this application.

Claims

1. A multi-scale soil carbon storage prediction method based on multi-source heterogeneous data fusion, characterized in that, include: Multi-source heterogeneous data is scale-adapted and standardized to obtain standard multi-source heterogeneous data at different scales; wherein, the scales include plot scale, small watershed scale and regional scale. A linear mixed-effects model was trained using standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer; A random forest model is trained using standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer, and the validation set of the intermediate connecting layer consists of plot-scale data. A random forest model is trained using standard multi-source heterogeneous data corresponding to the regional scale, and spatial interpolation is integrated to obtain a macro-constraint layer. The validation set of the macro-constraint layer consists of small watershed scale data. Using the intermediate connecting layer as the hub, the micro-anchoring layer and the macro-constraint layer are embedded as sub-modules and coupled to obtain a multi-scale coupled model. The training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer are mixed as training datasets to train the multi-scale coupled model, thus obtaining a multi-scale carbon storage prediction model. The multi-source heterogeneous data of the area to be tested are input into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage.

2. The method according to claim 1, characterized in that, The multi-source heterogeneous data includes physical data, chemical data, biological data, vegetation data, and environmental data; The physical data include soil moisture content, soil bulk density, clay and silt content, sand content, gravel content, and soil porosity. The chemical data include organic carbon (SOC) concentration, inorganic carbon (SIC) concentration, pH value, conductivity, total nitrogen, total phosphorus, carbon-to-nitrogen ratio, dissolved organic carbon, mineral nitrogen, and available phosphorus. The biological data includes microbial biomass carbon, microbial biomass nitrogen, microbial biomass phosphorus, extracellular enzyme activity, biomass-specific enzyme activity, enzyme stoichiometry, and microbial community structure. The vegetation data includes vegetation biomass, vegetation coverage, normalized difference vegetation index, vegetation type, root biomass, and litter volume. The environmental data includes temperature, precipitation, altitude, slope, slope position, watershed area, gully density, erosion modulus, and disaster intensity.

3. The method according to claim 2, characterized in that, The process of scaling and standardizing multi-source heterogeneous data to obtain standardized multi-source heterogeneous data at different scales includes: Complete the missing values ​​in the multi-source heterogeneous data and correct the outliers; The multi-source heterogeneous data are classified according to their heterogeneity and labeled with corresponding scales; Multi-source heterogeneous data is resampled hierarchically according to scale, and boundary calibration is performed on the resampled multi-source heterogeneous data; Based on the classification of multi-source heterogeneous data, a standardization method that conforms to its data characteristics is used to standardize the data, resulting in standard multi-source heterogeneous data at different scales under multiple classifications.

4. The method according to claim 3, characterized in that, Also includes: Establish a spatiotemporal grid model to aggregate standard multi-source heterogeneous data at different scales, including: The target region is discretized into an adaptive grid based on the scale of standard multi-source heterogeneous data; The time step is determined based on the research period of the target area, so as to establish a spatiotemporal grid model based on the adaptive grid; The spatial coordinates and time reference of the standard multi-source heterogeneous data are unified and mapped to the spatiotemporal grid model.

5. The method according to claim 1, characterized in that, The method of training a linear mixed-effects model using standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer includes: Extract the plot scale data corresponding to the plot scale from the standard multi-source heterogeneous data; The plot-scale data are grouped according to the soil use type of their sampling area, and the sample size in each group reaches the first number. The subjective weights of plot-scale data in each group were determined using the analytic hierarchy process. The objective weights of the plot-scale data in each group are determined based on the information entropy of the plot-scale data, and the weighting factors are determined in combination with the subjective weights. Multiple predictive factors are selected from the plot-scale data of each group using the weighting factors. A linear mixed-effects model is constructed based on multiple predictors, and the likelihood value of each linear mixed-effects model is determined by maximum likelihood estimation. The optimal linear mixed-effects model is determined based on the likelihood value, and the optimal linear mixed-effects model is refitted by restricted maximum likelihood estimation to obtain the micro-anchoring layer.

6. The method according to claim 5, characterized in that, The optimal linear mixed-effects model includes the optimal organic carbon prediction model and the optimal inorganic carbon prediction model; The optimal organic carbon prediction model is as follows: ; The optimal inorganic carbon prediction model is as follows: ; In the formula, This represents the organic carbon concentration of the j-th sample in the i-th soil use type. This represents the intercept term, where m represents the total number of predictor factors selected in the parcel-scale data. This represents the regression coefficient of the k-th predictor. This represents the k-th predictor in the plot-scale data of the j-th sample in the i-th soil use type. This represents the random effect of land use type for the i-th soil use type. This represents the random error of the j-th sample in the i-th soil use type. This represents the inorganic carbon concentration of the j-th sample in the i-th soil use type. , They represent , The regression coefficients, This represents the clay and silt content of the j-th sample in the i-th soil use type. This represents the pH value of the j-th sample in the i-th soil use type.

7. The method according to claim 1, characterized in that, The process of training a random forest model using standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connection layer includes: The first training set for constructing the random forest model includes small watershed-scale data extracted from the standard multi-source heterogeneous data, and the plot-scale data is weighted and expanded according to the land use ratio of the small watershed to serve as the first validation set. Set the core parameters of the random forest model; among which, the core parameters include the number of decision trees, the number of variables in the decision tree splits, and the minimum number of samples per node; The random forest model is trained using the first training set to identify key correction areas based on the deviation of the predicted values; The correlation between clay content and organic / inorganic carbon in the key correction region was calculated to determine the carbon stability coefficient. The predicted values ​​of the random forest model are corrected based on the carbon stability coefficient, and the random forest model is optimized using the first validation set to determine the intermediate connecting layer. Based on the revised forecast values ​​and small watershed-scale data, the carbon storage of each layer in the key revised area is determined, and the carbon storage of each layer is summed to obtain the total profile storage.

8. The method according to claim 1, characterized in that, The process of training a random forest model using standard multi-source heterogeneous data corresponding to the regional scale and integrating spatial interpolation to obtain a macro-constraint layer includes: A second training set for the random forest model is constructed. The second training set includes regional scale data corresponding to the regional scale extracted from the standard multi-source heterogeneous data, and the small watershed scale data is weighted and expanded according to the regional land use ratio to serve as the second validation set. Multiple core factors are selected from the regional-scale data based on the SHAP value, and factor weights are assigned to the corresponding core factors based on the SHAP value. Set the core parameters of the random forest model and train the random forest model using the second training set; Based on the second validation set, the core parameters of the random forest model were optimized using a grid search method. The disaster intensity in the standard multi-source heterogeneous data is quantified to determine the disaster intensity index; The predicted values ​​output by the macro-constraint layer are corrected using the disaster intensity index; The carbon density of all grid cells within the prediction area is determined based on the corrected prediction values; Co-kriging interpolation is used with normalized vegetation index as an auxiliary variable to improve the spatial continuity of the prediction area, thus obtaining the macro-constraint layer; wherein the output of the macro-constraint layer is a raster map of the organic / inorganic carbon storage of the prediction area.

9. A multi-scale soil carbon storage prediction device for implementing the method described in any one of claims 1-8, characterized in that, include: The data standardization module is used to perform scale adaptation and standardization processing on multi-source heterogeneous data to obtain standard multi-source heterogeneous data at different scales; wherein, the scales include plot scale, small watershed scale and regional scale. The first training module is used to train a linear mixed-effects model with standard multi-source heterogeneous data corresponding to the plot scale to obtain a micro-anchoring layer. The second training module is used to train a random forest model with standard multi-source heterogeneous data corresponding to the small watershed scale to obtain an intermediate connecting layer, and the validation set of the intermediate connecting layer is composed of plot-scale data. The third training module is used to train a random forest model with standard multi-source heterogeneous data corresponding to the regional scale, and integrate spatial interpolation to obtain a macro-constraint layer, wherein the validation set of the macro-constraint layer is composed of small watershed scale data. The coupling module is used to embed the micro-anchoring layer and the macro-constraint layer as sub-modules with the intermediate connecting layer as the hub, and couple them to obtain a multi-scale coupling model. The training set and output of the micro-anchoring layer, the intermediate connecting layer and the macro-constraint layer are mixed as training datasets to train the multi-scale coupling model and obtain a multi-scale carbon storage prediction model. The prediction module is used to input multi-source heterogeneous data of the area to be tested into the multi-scale carbon storage prediction model to obtain the corresponding predicted carbon storage.

10. An apparatus for performing a multi-scale soil carbon storage prediction method using multi-source heterogeneous data fusion, characterized in that, include: processor; Memory used to store processor-executable instructions; When the processor executes the executable instructions, it implements the method as described in any one of claims 1 to 8.