Factor dimension reduction and error correction method and system for runoff prediction
By screening and evaluating the correlation and contribution of runoff forecasting factors, and combining empirical mode decomposition and autoregressive models, the problems of factor redundancy and insufficient error correction in medium- and long-term runoff forecasting were solved, achieving higher forecast accuracy and model simplicity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAZHONG UNIV OF SCI & TECH
- Filing Date
- 2023-11-22
- Publication Date
- 2026-06-26
Smart Images

Figure CN117592009B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of runoff forecasting technology, and in particular to a method and system for factor dimensionality reduction and error correction for runoff forecasting. Background Technology
[0002] Medium- and long-term runoff forecasting methods often use historical runoff or meteorological data as model drivers, neglecting the interactive effects of global teleconnection factors. This approach fails to deeply analyze the physical mechanisms of runoff circulation and offers limited improvement in forecast accuracy. However, watershed runoff processes are the result of multiple factors, including runoff, temperature, precipitation, and atmospheric circulation. Including all of these in the modeling process leads to data redundancy, model complexity, and limited effectiveness in revealing the objective laws and mechanisms of runoff formation and change. As the number of forecasting factors increases, the complexity of model inputs and the risk of overfitting rise sharply. Therefore, dimensionality reduction of characteristic variables is fundamental to the construction of medium- and long-term runoff forecasting models. Selecting factor sequences with a high degree of correlation to runoff processes is crucial for improving forecast accuracy. Common methods for selecting forecasting factors include correlation coefficients, grey relational analysis, and mutual information methods. These methods often analyze the linear and nonlinear relationships between two variables, neglecting the contribution of characteristic variables to the prediction, thus offering limited reference value for practical forecasting model construction.
[0003] Error correction is a supplementary study in hydrological forecasting, improving forecast accuracy and reliability, and enhancing model applicability and generalization ability. Common error correction methods include autoregression, least squares, series-parallel coupling, and Kalman filtering. These methods reduce errors and improve accuracy by preprocessing the forecast residual sequence before prediction. However, they lack sufficient analysis of the regularity of the forecast residual sequence, making it difficult to uncover its patterns and limiting the improvement in forecast accuracy. Mode decomposition (MDD) can handle non-stationary and nonlinear signals and identify time-varying features in the error sequence, offering unparalleled advantages in forecast residual preprocessing. This can, to a certain extent, better uncover the potential information in the forecast residual sequence, weaken redundant information in the original sequence, and thus improve forecast accuracy.
[0004] To address this, this invention proposes a method for dimensionality reduction and error correction of feature variables for runoff forecasting. By introducing a set of teleconnection factors such as historical runoff, meteorological factors, and circulation index, a medium- to long-term runoff forecasting model based on an intelligent learning algorithm is constructed. Furthermore, an error correction framework of "decomposition-prediction-reconstruction" is proposed for the runoff forecast residuals, which improves the accuracy of watershed runoff forecasting while reducing the dimensionality of the model input and the risk of overfitting. Summary of the Invention
[0005] The purpose of this invention is to disclose a method and system for factor dimensionality reduction and error correction for runoff forecasting, so as to significantly improve the accuracy of runoff forecasting.
[0006] To achieve the above objectives, the method disclosed in this invention includes:
[0007] Step S1: Obtain the forecast factor data set for the target study area;
[0008] Step S2: Calculate the correlation coefficient between each forecast factor and runoff data in the dataset using the Pearson correlation coefficient, and select a set of strongly correlated factors;
[0009] Step S3: Use the limit gradient learning tree to evaluate the predictive importance contribution of each factor in the strongly correlated factor set, and sort them in descending order of importance.
[0010] Step S4: Construct a runoff forecast model. Use the strongly correlated factor set sorted in Step S3 as the model input. Calculate the Nash coefficient and root mean square error index under 5 to 20 input dimensions. Based on the principle that the larger the Nash coefficient and the smaller the root mean square error, determine the final input dimensions and runoff forecast values.
[0011] Step S5: Decompose the forecast residuals using ensemble empirical mode decomposition to obtain modal components and residual components;
[0012] Step S6: Use an autoregressive model to predict each modal component and the residual component separately;
[0013] Step S7: Superimpose and restore each modal component and the remaining component of the forecast residual, and add them to the original predicted value to finally obtain the sample predicted value after error correction.
[0014] To achieve the above objectives, the present invention also discloses a factor dimensionality reduction and error correction system for runoff forecasting, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above-described method when executing the computer program.
[0015] The present invention has the following beneficial effects:
[0016] Historical runoff data, meteorological data, and teleconnection factors such as circulation indices can be incorporated into runoff forecasting, enriching the input features of medium- and long-term runoff forecasting models. This helps reveal the objective laws and mechanisms of runoff formation and change, and provides a deeper analysis of runoff circulation mechanisms. A dimensionality reduction method for multi-dimensional feature variables in runoff forecasting is proposed; Pearson correlation coefficients and limiting gradient learning trees are combined to evaluate the contribution of each feature variable to the forecast. This reduces the dimensionality of the model input while improving runoff forecast accuracy. Finally, an error coupling correction framework based on ensemble empirical mode decomposition-autoregression is proposed for the forecast residuals. By performing "mode decomposition-AR prediction-mode ensemble" on the runoff forecast residuals, the resulting modal components reflect the signal trends and characteristics at different frequencies, making it easier to uncover the intrinsic laws of the runoff forecast residuals. Compared to currently used error correction methods, this approach significantly improves runoff forecast accuracy.
[0017] Therefore, the present invention has a clear concept, is easy to operate, and is highly practical.
[0018] The present invention will now be described in further detail with reference to the accompanying drawings. Attached Figure Description
[0019] The accompanying drawings, which form part of this application, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:
[0020] Figure 1 This is a flowchart of a method for factor dimensionality reduction and error correction for runoff forecasting disclosed in an embodiment of the present invention.
[0021] Figure 2 Yes Figure 1 The flowchart of the method shown is an overview. Detailed Implementation
[0022] The embodiments of the present invention will be described in detail below with reference to the accompanying drawings, but the present invention can be implemented in many different ways as defined and covered by the claims.
[0023] Example 1
[0024] This embodiment discloses a method for factor dimensionality reduction and error correction for runoff forecasting, referring to... Figure 1 This includes the following steps:
[0025] (1) Obtain the forecast factor data set for the target study area.
[0026] In this step, the dataset specifically refers to an air-sea-land factor dataset. It can be obtained by collecting monthly historical runoff data from the target hydrological station, a dataset of 130 climate indices, and meteorological station data within its control section. The meteorological station data mainly includes average dew point temperature, average precipitation, average station pressure, average temperature, average visibility, and average wind speed. The 130 climate indices set mainly includes 88 monthly atmospheric circulation indices, 26 monthly sea surface temperature indices, and 16 other monthly indices.
[0027] (2) The correlation coefficient between the strongly correlated factor dataset and the runoff data was obtained using the Pearson correlation coefficient. The formula for calculating the correlation coefficient is as follows:
[0028]
[0029] In the formula, Let X be the mean of the data sequence. Let be the mean of the data sequence Y, and z be the length of the data sequence. The Pearson correlation coefficient r ranges from [-1, 1]. The closer |r| is to 1, the stronger the correlation between variables X and Y. In this embodiment, if the correlation between variables |r| ≥ 0.6, it is considered a strong correlation factor.
[0030] (3) Use the limit gradient learning tree to evaluate the importance of the set of strongly correlated factors.
[0031] Extreme Gradient Boosting (XGBoost) is an improvement over Gradient Boosting Decision Tree (GBDT), enabling parallel construction of regression trees through multi-threading, maximizing computational speed and efficiency. XGBoost performs a second-order Taylor expansion on the loss function, resulting in higher computational accuracy. Furthermore, XGBoost incorporates a regularization term into the objective function, effectively preventing overfitting. The XGBoost objective function L... k as follows:
[0032]
[0033] In the formula, Let Ω(f) be the loss function, representing the difference between the predicted and actual values. k ) is a regularization function used to control model complexity, x i f is a characteristic variable. K (x i ) represents the Kth tree pair x i The prediction result is given by v, where v represents the number of training samples and K is the number of decision trees.
[0034] The regularization calculation formula is as follows:
[0035]
[0036] In the formula, T is the number of leaf points of the k-th tree, λ and γ are the hyperparameters of the regularization term, and ω is the weight of the leaf point of the k-th tree.
[0037] Regarding formula (2) Taylor expansion is used to obtain an approximate loss function. In XGBoost, the loss function is used to evaluate the score of the generated tree; a smaller score indicates a better tree structure. Gain is an important metric for selecting the optimal split point, and its calculation method is as follows:
[0038]
[0039] In the formula, I L and I R Let represent the sample spaces of the left and right nodes after the split, respectively.
[0040] After a node split, the XGBoost algorithm divides the node's sample space into two disjoint spaces. Among all the current leaf nodes, we select the node with the largest information gain as the split point. For each feature variable, XGBoost calculates the average gain across all tree nodes and then normalizes these average gain values to obtain the relative importance score of the feature. The formula for calculating the gain at each tree node is as follows:
[0041]
[0042] In the formula, VIM i is the importance assessment value of the i-th feature variable, and N is the number of feature variables.
[0043] (4) Construct a runoff forecast model, take the strongly correlated factor set after the above steps as the model input, and calculate the Nash coefficient and root mean square error index under 5 to 20 input dimensions; determine the final input dimensions and runoff forecast values based on the principle that the larger the Nash coefficient and the smaller the root mean square error.
[0044] In evaluating forecast performance using the Nash coefficient (NSE) and root mean square error (RMSE), a larger NSE and a smaller RMSE indicate better model forecasting, thus determining the optimal input dimension. The evaluation metrics are calculated as follows:
[0045]
[0046]
[0047] In the formula, Q iThese are measured values of hydrological elements. Here are the predicted values for hydrological elements, and n is the length of the forecast sequence data. It is the average value of the measured values of hydrological elements within the length of the forecast sequence.
[0048] In the selection of specific runoff forecasting models, this embodiment may use any one of the following models, or at least two of them combined, such as Gaussian process regression (GPR), long short-term memory neural network (LSTM), and support vector machine (SVM), to construct the runoff forecasting model.
[0049] (5) The forecast residuals are decomposed by ensemble empirical mode decomposition (EEMD) to obtain modal components.
[0050] Gaussian white noise is added to the Empirical Mode Decomposition (EMD) algorithm. Finally, the overall average of the Intrinsic Mode Functions (IMFs) from multiple decompositions is defined as the final IMF, thereby effectively suppressing mode aliasing. The decomposition result is as follows:
[0051]
[0052] In the formula, t is time, s is the number of modes, x(t) is the original signal, and c(t) is the residual component; imf j (t) is the j-th modal component IMF obtained from EMD decomposition.
[0053] EEMD is mainly calculated by performing M trials on EMD, obtaining the mean of all IMF components and the remaining components, which is EEMD. The calculation formula is as follows:
[0054]
[0055]
[0056] In the formula, M represents the total number of EMD tests. and These represent the j-th IMF component and the remaining components obtained from the EEMD decomposition, respectively. Generally, each IMF component obtained from the EEMD decomposition represents a high-frequency component and a low-frequency component of the data. Therefore, by observing these IMF components, we can understand the characteristics and trends of different time and frequency dimensions in the data.
[0057] (6) The autoregressive (AR) model is used to predict each modal component and the residual component separately. The AR calculation formula is as follows:
[0058]
[0059] In the formula, X t For time series data, including and It is a constant term, and ε is the autoregressive coefficient; p is the autoregressive order; ε t To conform to a mean of 0 and a variance of The normally distributed white noise signal.
[0060] (7) The modal components and residual components of the forecast residual are superimposed and restored, and added to the original forecast value to finally obtain the sample forecast value after error correction.
[0061] Example 2
[0062] This embodiment discloses a factor dimensionality reduction and error correction system for runoff forecasting, including a memory, a processor, and a computer program stored in the memory and executable on the processor. The system is characterized in that the processor, when executing the computer program, implements a series of steps corresponding to the method in Embodiment 1 above. For example... Figure 2 As shown, the series of steps can be summarized as follows:
[0063] Step S1: Obtain the forecast factor data set for the target study area;
[0064] Step S2: Calculate the correlation coefficient between each forecast factor and runoff data in the dataset using the Pearson correlation coefficient, and select a set of strongly correlated factors;
[0065] Step S3: Use the limit gradient learning tree to evaluate the predictive importance contribution of each factor in the strongly correlated factor set, and sort them in descending order of importance.
[0066] Step S4: Construct a runoff forecast model. Use the strongly correlated factor set sorted in Step S3 as the model input. Calculate the Nash coefficient and root mean square error index under 5 to 20 input dimensions. Based on the principle that the larger the Nash coefficient and the smaller the root mean square error, determine the final input dimensions and runoff forecast values.
[0067] Step S5: Decompose the forecast residuals using ensemble empirical mode decomposition to obtain modal components and residual components;
[0068] Step S6: Use an autoregressive model to predict each modal component and the residual component separately;
[0069] Step S7: Superimpose and restore each modal component and the remaining component of the forecast residual, and add them to the original predicted value to finally obtain the sample predicted value after error correction.
[0070] The methods and systems disclosed in the above embodiments of the present invention have achieved excellent forecasting accuracy through multiple tests at different hydrological stations in multiple regions. Specifically, when backtracking historical data, the predicted values are essentially historical runoff simulations, as the measured values are known and therefore do not involve future issues. However, to predict future values, where the measured values and residuals are unknown, the present invention requires rolling forecasting in its implementation. Future forecast values are predicted using a forecasting model, and the residuals are also predicted using an AR model, thus obtaining future forecast correction values. Typically, a single forecast only yields values for one future period. But what if multiple future periods need to be predicted? This is where rolling forecasting comes in. Earlier time-series forecast values are used as forecasting factors for further prediction, and this rolling process can logically continue indefinitely. However, as the forecast sequence grows, the prediction error will increase significantly.
[0071] In summary, the methods and systems disclosed in this invention can incorporate historical runoff data, meteorological data, and teleconnection factors such as circulation indices into runoff forecasting, enriching the input features of medium- and long-term runoff forecasting models. This helps reveal the objective laws and mechanisms of runoff formation and change, and provides a deeper analysis of runoff circulation mechanisms. Simultaneously, a dimensionality reduction method for multi-dimensional feature variables in runoff forecasting is proposed; combining Pearson correlation coefficients and limiting gradient learning trees evaluates the contribution of each feature variable to the forecast. This reduces the dimensionality of the model input while improving runoff forecast accuracy. Finally, an error coupling correction framework based on ensemble empirical mode decomposition-autoregression is proposed for the forecast residuals. By performing "mode decomposition-AR prediction-mode ensemble" on the runoff forecast residuals, the resulting modal components reflect the signal trends and characteristics at different frequencies, making it easier to uncover the intrinsic laws of the runoff forecast residuals. Compared to commonly used error correction methods, this significantly improves runoff forecast accuracy. Therefore, this invention is clear in its approach, convenient to operate, and highly practical.
[0072] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for factor dimensionality reduction and error correction for runoff forecasting, characterized in that, include: Step S1: Obtain the forecast factor data set for the target study area; Step S2: Calculate the correlation coefficient between each forecast factor and runoff data in the dataset using the Pearson correlation coefficient, and select a set of strongly correlated factors; Step S3: Use the limit gradient learning tree to evaluate the predictive importance contribution of each factor in the strongly correlated factor set, and sort them in descending order of importance. Step S4: Construct a runoff forecast model. Use the strongly correlated factor set sorted in Step S3 as the model input. Calculate the Nash coefficient and root mean square error index under 5 to 20 input dimensions. Based on the principle that the larger the Nash coefficient and the smaller the root mean square error, determine the final input dimensions and runoff forecast values. Step S5: Decompose the forecast residuals using ensemble empirical mode decomposition to obtain modal components and residual components; Step S6: Use an autoregressive model to predict each modal component and the residual component separately; Step S7: Superimpose and restore each modal component and the remaining component of the forecast residual, and add them to the original predicted value to finally obtain the sample predicted value after error correction.
2. The method for factor dimensionality reduction and error correction for runoff forecasting according to claim 1, characterized in that, The correlation coefficients between each of the strongly correlated factors in the set of factors and the runoff data are greater than or equal to 0.
6.
3. The method for factor dimensionality reduction and error correction for runoff forecasting according to claim 1, characterized in that, The limiting gradient learning tree is used to evaluate the predictive importance contribution of each factor in the set of strongly correlated factors, including: For each feature variable, the limiting gradient learning tree calculates the average gain at each tree node, then normalizes these average gain values to obtain the relative importance score of the feature; the formula for calculating the gain at each tree node is as follows: In the formula, VIM i is the importance assessment value of the i-th feature variable, and N is the number of feature variables; the formula for calculating Gain is: In the formula, I L and I R Let L represent the sample spaces of the left and right nodes after the split, respectively; where L is the objective function of the limiting gradient learning tree. k as follows: In the formula, Let Ω(f) be the loss function, representing the difference between the predicted and actual values. k ) is the regularization function, x i f is a characteristic variable. K (x i ) represents the Kth tree pair x i The prediction result is given by v, where v represents the number of training samples and K is the number of decision trees; λ and γ are hyperparameters of the regularization term.
4. The method for factor dimensionality reduction and error correction for runoff forecasting according to claim 1, characterized in that, The process of acquiring modal components and residual components includes: Gaussian white noise is added to the empirical mode decomposition algorithm, and the overall average of the modal component IMFs from multiple decompositions is defined as the final IMF, in order to effectively suppress the generation of mode mixing. The decomposition result is as follows: In the formula, t is time, s is the number of modes, x(t) is the original signal, and c(t) is the residual component; imf j (t) is the j-th modal component IMF obtained from EMD decomposition; Perform M trials on EMD to obtain the mean of all IMF components and the remaining components, which is EEMD. The calculation formula is as follows: In the formula, M represents the total number of trials. and These are the j-th IMF component and the remaining component obtained from the EEMD decomposition, respectively.
5. The method for factor dimensionality reduction and error correction for runoff forecasting according to claim 4, characterized in that, The calculation formulas for the autoregressive model include: In the formula, X t For time series data, including and It is a constant term, and ε is the autoregressive coefficient; p is the autoregressive order; ε t It is a white noise signal.
6. The method for factor dimensionality reduction and error correction for runoff forecasting according to any one of claims 1 to 5, characterized in that, Also includes: When predicting future runoff data, the forecast model is used to make rolling predictions for future forecast values, and the residuals are also made rolling predictions using an autoregressive model, so as to obtain the future forecast correction values.
7. A factor dimensionality reduction and error correction system for runoff forecasting, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method described in any one of claims 1 to 6.