A multi-stage prediction method and system for soil erosion characteristics based on Bayesian optimization CatBoost

By optimizing the CatBoost model using Bayesian optimization and feature contribution analysis, the problems of data consistency and model accuracy in soil erosion characteristic prediction were solved, achieving efficient and accurate multi-level prediction of soil erosion characteristics and meeting the needs of engineering emergency response.

CN122241038APending Publication Date: 2026-06-19NANJING HYDRAULIC RES INST +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANJING HYDRAULIC RES INST
Filing Date
2026-01-29
Publication Date
2026-06-19

Smart Images

  • Figure CN122241038A_ABST
    Figure CN122241038A_ABST
Patent Text Reader

Abstract

This invention discloses a multi-level prediction method and system for soil erosion characteristics based on Bayesian-optimized CatBoost. The method includes: acquiring soil erosion data and preprocessing the data; using SHAP to calculate the contribution of features to the prediction target from the preprocessed data, and dividing the features according to their contribution; selecting the corresponding feature set according to the current prediction mode, constructing a CatBoost model, and using the weighted sum of prediction error and training time as the objective function; optimizing the hyperparameters of the CatBoost model using a Bayesian optimization method to obtain the optimal hyperparameters; using the preprocessed soil erosion data as input, training the CatBoost model using the optimal hyperparameters, and evaluating the model performance through K-fold cross-validation. This invention improves the efficiency and accuracy of soil erosion characteristic prediction, providing technical support for geotechnical engineering safety assessment and disaster prevention and mitigation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a method and system for predicting soil erosion characteristics, and more particularly to a multi-level prediction method and system for soil erosion characteristics based on Bayesian optimization CatBoost. Background Technology

[0002] In the field of geotechnical engineering, the study of soil erosion dynamics is a core topic in the safety assessment of water conservancy infrastructure such as dikes and dams. According to statistics from the International Commission on Large Dams, 21% of earth-rock dam failures worldwide are directly related to progressive erosion damage, making accurate prediction of soil erosion resistance crucial for ensuring project safety.

[0003] Traditional soil erosion characteristic testing procedures are complex, time-consuming, and highly dependent. For example, conventional erosion tests require precise control of hydrodynamic conditions and soil sample preparation, and data acquired by various devices exhibit variability. Different testing principles (velocity conversion and torque conversion) lead to significant distributional drift in the characteristic space of the acquired erosion data. This nonlinear physical distribution difference cannot be eliminated by simple linear normalization, limiting the comparability and reliability of the results. Furthermore, existing numerical models are insufficient in capturing complex parameter correlations, primarily relying on single-parameter models. For instance, while the classic Wilson model and overshear stress model provide a general framework for studying the erosion resistance of different soils at different flow velocities, they only consider water flow shear force, neglecting the coupling effect of soil porosity and clay content, and cannot predict the erosion development law when the soil layer thickness is finite. This may lead to conservative predictions with large errors in practical engineering, failing to meet engineering requirements.

[0004] With the increasing demands for prediction efficiency and accuracy in geotechnical engineering, there is an urgent need to establish intelligent evaluation methods that balance both. In July 2024, extreme heavy rainfall caused multiple breaches in the Dongting Lake dikes, resulting in severe flooding. This Dongting Lake dike breach highlighted the lack of rapid prediction models, making it difficult to predict and control the breaches in a short period, leading to significant economic losses and social impact. Machine learning techniques, such as the Bayesian-optimized CatBoost model, have demonstrated excellent predictive performance in many fields, but their application in predicting soil erosion characteristics is still in the exploratory stage. Therefore, providing a multi-level prediction method and system for soil erosion characteristics based on Bayesian-optimized CatBoost is of great significance. Summary of the Invention

[0005] Purpose of the invention: The purpose of this invention is to provide a multi-level prediction method and system for soil erosion characteristics based on Bayesian optimization CatBoost, which achieves high accuracy, fast response and efficient calculation in predicting soil erosion characteristics, and provides reliable technical support for engineering design and emergency assessment.

[0006] Technical solution: The present invention provides a multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost, comprising:

[0007] Acquire soil erosion data and preprocess the data;

[0008] The preprocessed data is used to calculate the contribution of features to the prediction target using SHAP, and the features are then divided according to their contribution.

[0009] Select the corresponding feature set based on the current prediction mode, construct the CatBoost model, and use the weighted sum of prediction error and training time as the objective function.

[0010] The hyperparameters of the CatBoost model were optimized using the Bayesian optimization method to obtain the optimal hyperparameters.

[0011] Using preprocessed soil erosion data as input, a CatBoost model is trained using the optimal hyperparameters, and the model performance is evaluated using K-fold cross-validation.

[0012] Furthermore, the soil erosion data includes soil erosion performance indicators and soil characteristic parameters. The preprocessing includes normalizing the soil erosion performance indicators and soil characteristic parameters, using the optimal transfer mapping method to map the source domain data to the target domain distribution for the soil category characteristic data, using the KS test to check the consistency of data distribution from different devices, and selecting representative samples.

[0013] Furthermore, the optimal transport mapping method includes constructing a cost matrix, wherein the elements of the cost matrix represent the distance between the source domain sample and the target domain sample in the feature space, and obtaining the mapping weights by solving the optimal transport problem with entropy regularization.

[0014] Furthermore, the segmentation of features according to contribution includes:

[0015] Obtain the preprocessed soil erosion data and calculate the mean and standard deviation of the contribution of all features;

[0016] Features whose contribution is higher than the mean plus several times the standard deviation are classified as key parameters;

[0017] Features whose contribution falls between the mean and a certain number of standard deviations are classified as secondary parameters;

[0018] Features whose contribution is lower than the mean minus several times the standard deviation are classified as low-correlation parameters.

[0019] Furthermore, the prediction modes include emergency prediction mode, regular prediction mode, and precision prediction mode.

[0020] Furthermore, the Bayesian optimization uses Gaussian process regression as a surrogate model to establish a mapping between hyperparameters and the objective function, and selects the optimal hyperparameters using the expectation enhancement criterion.

[0021] Furthermore, training the CatBoost model using the optimal hyperparameters and evaluating the model performance through K-fold cross-validation includes:

[0022] The preprocessed soil erosion data is divided into soil characteristic parameters and soil erosion performance indicators, which are used as input feature matrix and target vector, respectively. The data is then divided into validation set and training set through nested cross-validation.

[0023] Initialize the CatBoost model using the optimal hyperparameters;

[0024] The training set is input into the CatBoost model to train the model, and an early stopping mechanism is used to prevent the model from overfitting.

[0025] The model performance was validated using a validation set, and the root mean square error and coefficient of determination were used to evaluate the model.

[0026] The present invention discloses a multi-level prediction system for soil erosion characteristics based on Bayesian optimization CatBoost, comprising:

[0027] Data acquisition module: Acquires soil erosion data and preprocesses the data;

[0028] Feature classification module: SHAP is used to calculate the contribution of features to the prediction target from the preprocessed data, and the features are classified according to their contribution.

[0029] Model building module: Select the corresponding feature set according to the current prediction mode, build the CatBoost model, and use the weighted sum of prediction error and training time as the objective function;

[0030] Parameter optimization module: The hyperparameters of the CatBoost model are optimized using the Bayesian optimization method to obtain the optimal hyperparameters;

[0031] Model training module: Using preprocessed soil erosion data as input, the CatBoost model is trained using the optimal hyperparameters, and the model performance is evaluated through K-fold cross-validation.

[0032] The computer device of the present invention includes one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the programs, when executed by the processors, implement the steps of the multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost.

[0033] The present invention discloses a computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, it implements the steps of a multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost.

[0034] Beneficial effects: Compared with the prior art, the present invention has the following significant advantages:

[0035] (1) By introducing optimal transport mapping, this invention effectively solves the comparability problem between soil erosion data collected by different devices, ensuring the consistency and fusion of multi-source data. This enables data from different devices to be effectively compared and analyzed in a unified feature space, significantly improving the training data quality of the soil erosion prediction model, thereby enhancing the prediction accuracy and stability of the model.

[0036] (2) This invention identifies and classifies key features affecting soil erosion prediction through SHAP feature contribution analysis. By combining multi-level prediction modes, the computational accuracy and efficiency of the model can be flexibly adjusted according to different application scenarios, so that the model can meet the real-time requirements of emergency response.

[0037] (3) This invention uses Bayesian optimization for hyperparameter tuning and combines early stopping mechanism and CatBoost model training to ensure the efficiency, accuracy and stability of the model. Finally, cross-validation and performance evaluation are used to ensure the application value of the model in soil erosion prediction tasks and improve the prediction accuracy and generalization ability. Attached Figure Description

[0038] Figure 1 This is a flowchart of the method described in this invention.

[0039] Figure 2 This is a comparison chart showing the fitting effect between the predicted and actual values ​​of soil erosion characteristics under three prediction modes. Detailed Implementation

[0040] The technical solution of the present invention will be further described below with reference to the accompanying drawings.

[0041] Example 1

[0042] This embodiment provides a multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost, including:

[0043] Step S1: Obtain soil erosion data and preprocess the data;

[0044] Specifically, it includes the following steps:

[0045] Step S11: Obtain the input soil erosion test data. The data includes soil erosion performance indicators such as erosion rate, erosion resistance coefficient, and critical shear stress, as well as soil characteristic parameters such as liquid limit, plastic limit, plasticity index, median particle size, water content, and soil type. The soil erosion performance indicators and soil characteristic parameters measured by different equipment (such as EFA, ESTD, SET, etc.) are normalized to eliminate the influence of experimental conditions on the data distribution. The calculation is as follows:

[0046]

[0047] in, For equipment The collected number Group data, and respectively equipment The mean and standard deviation of the collected data.

[0048] Step S12: Obtain category feature data (such as soil type, test environment, etc.) and use the optimal transport mapping method based on the cost matrix to ensure the physical comparability of data from different sources.

[0049] Specifically, data distribution across different devices exhibits a "domain drift" phenomenon. To address this, source domain data... (Non-standard equipment data) accurately mapped to the target domain (Standard equipment data), construct cost matrix C, elements C of the cost matrix. i,j Defined as the distance metric between the i-th sample in the source domain and the j-th sample in the target domain in the feature space, its elements are... The calculation is as follows:

[0050]

[0051] in, This represents the Euclidean distance between eigenvectors. The cost matrix quantifies the similarity in physical properties (such as particle size, plasticity index, etc.) between the i-th soil sample in the source domain and the j-th soil sample in the target domain.

[0052] Based on the constructed cost matrix C, the optimal transmission scheme is solved using the Sinkhorn algorithm. That is, to find the joint probability distribution matrix such that... Transform into The total transmission cost is minimized. The optimization objective function is:

[0053]

[0054] in, Indicate that all marginal distributions are respectively and The joint distribution set; This is an entropy regularization term, used to improve computation speed and smooth mapping results; Indicate category arrive Mapping weights, This is the regularization parameter.

[0055] Use the solved optimal mapping weights The source domain data is transformed to the target domain space through barycentric mapping:

[0056]

[0057] in, This represents the transformed category data.

[0058] Step S13: Obtain the normalized data obtained in step S11 and the data obtained in step S12, and use the KS test to check the consistency of the data distribution, and select representative samples, specifically:

[0059]

[0060] in, For KS statistics, This is the empirical distribution function for data from different devices.

[0061] Step S2: Use SHAP to calculate the contribution of the preprocessed data to the prediction target, and divide the features according to their contribution.

[0062] Specifically, it includes the following steps:

[0063] Step S21: Obtain pre-processed soil erosion data Calculate each feature In predicting targets The contribution is as follows:

[0064] in, Features Its global importance The number of samples in the dataset. Features In the SHAP values ​​in each sample.

[0065] Step S22: Obtain the global importance S of each feature, and calculate the mean importance of all features. and standard deviation Features are categorized into key parameters, secondary parameters, and low-relevance parameters. Specifically:

[0066] for The features are divided into key parameters, for The features are divided into secondary parameters, for The features are categorized into low-correlation parameters. Among them, It is a regulating factor.

[0067] Step S23: Based on the obtained feature partitioning, select an appropriate feature set to input into the subsequent prediction model, such as... Figure 2 The figure shows a comparison of the fitting effects between the predicted and actual values ​​of soil erosion characteristics under three prediction modes; specifically: Figure 2 As shown in (a), for the emergency prediction model, only key parameters are used, and the feature set includes parameters such as median particle size and plasticity index, ensuring a response time of less than 30 seconds and providing reliable prediction accuracy. ),like Figure 2 As shown in (b), for the conventional prediction model, key parameters and secondary parameters are used in combination, and the feature set includes parameters such as liquid limit and water content, to ensure that the model provides prediction results within 2 minutes and provides high prediction accuracy. ),like Figure 2 As shown in (c), for the precision prediction model, key parameters, minor parameters, and low-correlation parameters are used. The feature set includes parameters such as pH, conductivity, and tensile strength, ensuring that the model completes predictions within 5 minutes and provides the highest accuracy prediction results. ).

[0068] Step S3: Select the corresponding feature set according to the current prediction mode, construct the CatBoost model, and use the weighted sum of prediction error and training time as the objective function.

[0069] Specifically, based on the current prediction model, the segmented feature set is obtained, a CatBoost model is constructed, and the target parameters are defined as follows:

[0070]

[0071] in, This is the set of hyperparameters for the CatBoost model, including learning rate, tree depth, number of iterations, etc. Regularization coefficient, The objective function of the CatBoost model is... The root mean square error, For the time required for training, , This is a weighting coefficient used to control the balance between accuracy and computational efficiency.

[0072] Step S4: Use the Bayesian optimization method to optimize the hyperparameters of the CatBoost model to obtain the optimal hyperparameters;

[0073] Specifically, it includes the following steps:

[0074] Step S41, based on model parameters , and its target parameters Using Gaussian process regression as a surrogate model, a mapping between hyperparameters and the objective function is established, specifically as follows:

[0075]

[0076] in, For Gaussian processes, It is a mean function. Let be the covariance function.

[0077] Step S42: Based on the mapping between hyperparameters and the objective function, the optimal hyperparameters are selected using the expectation enhancement criterion, specifically as follows:

[0078]

[0079] in, This represents the current optimal objective function value.

[0080] Step S5: Using the preprocessed soil erosion data as input, train the CatBoost model using the optimal hyperparameters, and evaluate the model performance through cross-validation.

[0081] Step S51: Obtain preprocessed soil erosion data, including soil characteristic parameters and soil erosion performance indices, and divide them into input feature matrices. (Soil characteristic parameters) and target vector (Soil erosion performance index). The dataset is divided into K subsets using nested cross-validation, with one subset serving as the validation set and the remaining subsets as the training set, thus obtaining the training set features. and its corresponding target value and validation set features and its corresponding target value .

[0082] Step S52: Obtain the optimal hyperparameters. The CatBoost model is initialized.

[0083] Step S53: Obtain the training set features obtained in step S51. Training set target value Obtain the CatBoost model obtained in step S52, train the model, and adopt an early stopping mechanism to prevent the model from overfitting. Specifically, if the validation set error does not improve significantly within a specified number of rounds, training will be stopped early.

[0084] Step S54: Obtain the verification set obtained in step S51. and target value The training process is monitored to ensure the stability of model performance. Specifically, the evaluation metrics are root mean square error and coefficient of determination, which are calculated as follows:

[0085]

[0086] in, For the true value, For predicted values, This is the average of the actual values.

[0087] Step S55: For all training sets in step S51 Validation set To obtain the final CatBoost model, repeat steps S53 to S54 as described above until all K-fold data has been processed.

[0088] Example 2

[0089] This embodiment also provides a multi-level prediction system for soil erosion characteristics based on Bayesian optimized CatBoost, including:

[0090] Data acquisition module: Acquires soil erosion data and preprocesses the data;

[0091] Feature classification module: SHAP is used to calculate the contribution of features to the prediction target from the preprocessed data, and the features are classified according to their contribution.

[0092] Model building module: Select the corresponding feature subset based on the current prediction mode, build the CatBoost model, and use the weighted sum of prediction error and training time as the objective function;

[0093] Parameter optimization module: The hyperparameters of the CatBoost model are optimized using the Bayesian optimization method to obtain the optimal hyperparameters;

[0094] Model training module: Using preprocessed soil erosion data as input, the CatBoost model is trained using the optimal hyperparameters, and the model performance is evaluated through cross-validation.

[0095] This embodiment is based on the same inventive concept as Embodiment 1, and will not be repeated here.

[0096] Example 3

[0097] This embodiment also provides a computer device, including one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the programs, when executed by the processors, implement the steps of the multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost.

[0098] Example 4

[0099] This embodiment also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost.

Claims

1. A multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost, characterized in that, include: Acquire soil erosion data and preprocess the data; The preprocessed data is used to calculate the contribution of features to the prediction target using SHAP, and the features are then divided according to their contribution. Select the corresponding feature set based on the current prediction mode, construct the CatBoost model, and use the weighted sum of prediction error and training time as the objective function. The hyperparameters of the CatBoost model were optimized using the Bayesian optimization method to obtain the optimal hyperparameters. Using preprocessed soil erosion data as input, a CatBoost model is trained using the optimal hyperparameters, and the model performance is evaluated using K-fold cross-validation.

2. The method for multi-level prediction of soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 1, characterized in that, The soil erosion data includes soil erosion performance indicators and soil characteristic parameters. The preprocessing includes normalizing the soil erosion performance indicators and soil characteristic parameters, using the optimal transfer mapping method to map the source domain data to the target domain distribution for soil category characteristic data, using the KS test to check the consistency of data distribution from different devices, and selecting representative samples.

3. The method for multi-level prediction of soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 2, characterized in that, The optimal transport mapping method includes constructing a cost matrix, wherein the elements of the cost matrix represent the distance between the source domain sample and the target domain sample in the feature space, and obtaining the mapping weights by solving the optimal transport problem with entropy regularization.

4. The method for multi-level prediction of soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 1, characterized in that, The segmentation of features according to contribution includes: Obtain the preprocessed soil erosion data and calculate the mean and standard deviation of the contribution of all features; Features whose contribution is higher than the mean plus several times the standard deviation are classified as key parameters; Features whose contribution falls between the mean and a certain number of standard deviations are classified as secondary parameters; Features whose contribution is lower than the mean minus several times the standard deviation are classified as low-correlation parameters.

5. The method for multi-level prediction of soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 1, characterized in that, The prediction modes include emergency prediction mode, regular prediction mode, and precision prediction mode.

6. A multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 1, characterized in that, The Bayesian optimization uses Gaussian process regression as a surrogate model to establish a mapping between hyperparameters and the objective function, and selects the optimal hyperparameters using the expectation enhancement criterion.

7. A multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost as described in claim 1, characterized in that, The process of training the CatBoost model using the optimal hyperparameters and evaluating its performance through K-fold cross-validation includes: The preprocessed soil erosion data is divided into soil characteristic parameters and soil erosion performance indicators, which are used as input feature matrix and target vector, respectively. The data is then divided into validation set and training set through nested cross-validation. Initialize the CatBoost model using the optimal hyperparameters; The training set is input into the CatBoost model to train the model, and an early stopping mechanism is used to prevent the model from overfitting. The model performance was validated using a validation set, and the root mean square error and coefficient of determination were used to evaluate the model.

8. A multi-level prediction system for soil erosion characteristics based on Bayesian optimization CatBoost, characterized in that, include: Data acquisition module: Acquires soil erosion data and preprocesses the data; Feature classification module: SHAP is used to calculate the contribution of features to the prediction target from the preprocessed data, and the features are classified according to their contribution. Model building module: Select the corresponding feature set according to the current prediction mode, build the CatBoost model, and use the weighted sum of prediction error and training time as the objective function; Parameter optimization module: The hyperparameters of the CatBoost model are optimized using the Bayesian optimization method to obtain the optimal hyperparameters; Model training module: Using preprocessed soil erosion data as input, the CatBoost model is trained using the optimal hyperparameters, and the model performance is evaluated through K-fold cross-validation.

9. A computer device, characterized in that, The method includes one or more processors, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, and the programs, when executed by the processors, implement the steps of a multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost as described in any one of claims 1-7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of a multi-level prediction method for soil erosion characteristics based on Bayesian optimization CatBoost as described in any one of claims 1-7.