A water environment time series prediction model construction method based on data dynamic feature mining

By combining grey relational analysis and sparrow optimization algorithm with STL decomposition method and LSTM model, the problem of high complexity and low efficiency in water environment data prediction is solved, realizing efficient and accurate water environment time series prediction and supporting environmental management decision-making.

CN118349839BActive Publication Date: 2026-06-30HARBIN INST OF TECH +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HARBIN INST OF TECH
Filing Date
2024-05-14
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing water environment data prediction models suffer from high complexity and low efficiency, especially in large watersheds, large sample sizes, and long time scales where accurate predictions are difficult to achieve.

Method used

A water environment time series prediction model based on dynamic feature mining of data is adopted. Relevant variables are screened through grey relational analysis. The parameters of the seasonal trend time series decomposition method STL with local weighted regression are optimized by combining the Coupled Sparrow Optimization Algorithm (SSA). An LSTM model is established for training and prediction is performed using the SSA-STL-LSTM model.

Benefits of technology

It improves the accuracy and robustness of water environment data prediction, adapts to seasonal fluctuations, enhances data mining efficiency, maintains a high level of prediction accuracy and stability, and supports the decision-making of environmental management departments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118349839B_ABST
    Figure CN118349839B_ABST
Patent Text Reader

Abstract

This invention discloses a method for constructing a time-series prediction model for the water environment based on dynamic feature mining of data, belonging to the field of environmental engineering technology. It solves the problems of high complexity and low efficiency in existing methods for dynamic feature mining of water environment data. The invention identifies target variables for a target watershed, collects historical data of the target variables and related variables, extracts features from the historical data, and establishes an initial input dataset. It uses grey relational analysis to analyze the correlation between the features of the target variables and the features of related variables, filtering variables whose correlation with the target variables exceeds a threshold. It uses SSA to optimize the parameters of STL, and uses the optimized STL to decompose the historical data of the target variables. Finally, it establishes an LSTM model, trains the LSTM model using the decomposed data and variables whose correlation exceeds the threshold, and obtains the SSA-LSTM model. This invention is applicable to water environment prediction and dynamic feature mining.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the interdisciplinary field of environmental engineering, water environment big data, integrated watershed management, environmental system simulation and prediction technology and computer technology. Background Technology

[0002] With economic development, human activities such as the irrational use of water resources have led to increasingly serious water pollution problems in river basins and a growing shortage of water resources. Water environment data is affected by seasonality and rainwater runoff, resulting in large fluctuations. Furthermore, due to external conditions and instrument limitations, data gaps and anomalies are common. Environmental quality prediction is the foundation of regional environmental management. It involves real-time monitoring of environmental parameter changes within a certain range, combined with the current ecological environment, pollutant migration characteristics, and pollution processes caused by socio-economic development, to analyze their impact on future environmental conditions and predict trends in environmental parameter changes. In recent decades, various regions have gradually emphasized the collection and organization of local environmental data. Many environmental quality monitoring stations have established comprehensive hydrological and meteorological databases, preserving monitoring data for decades. This provides a foundation for using data-driven methods to address the dynamic characteristics of water environment data. However, environmental quality prediction is still in its developmental stage. How to effectively utilize existing environmental monitoring data to achieve accurate environmental prediction is of great guiding significance for responding to environmental crises and realizing precise prevention and control of watershed pollution and scientific decision-making.

[0003] Currently, water environment data prediction models mainly fall into three categories: physical models, machine learning models, and deep learning models. Mechanism-based models, composed of a series of continuous and dynamic equations, can dynamically simulate the generation, migration, and transformation of pollutants in water bodies, offering high simulation accuracy. However, parameter calibration is difficult, and they cannot address spatial heterogeneity issues in large-scale watershed applications, limiting their application scale. Machine learning models, based on probability and statistics, do not consider the complex migration and transformation processes of pollutants within water bodies. Instead, they utilize big data and computational thinking to solve environmental problems, achieving high accuracy in small sample sizes. However, they can only reflect the mapping relationship between multivariate monitoring data and the water quality parameters to be measured. Deep learning models excel at using historical information to aid current decision-making, especially Long Short-Term Memory (LSTM) neural networks. Their unique neuronal structure gives them selective memory, fully considering long-term dependencies in time-series data, resulting in high prediction accuracy. They are suitable for water environment data prediction at large watershed scales, with large sample sizes and long time scales, and possess high nonlinear data processing capabilities. However, water environment data prediction often faces bottlenecks such as large monitoring data requirements and time-consuming and computationally intensive parameter optimization. Summary of the Invention

[0004] This invention aims to address the problems of high complexity and low efficiency in existing methods for mining dynamic features of water environment data. It provides a method for constructing a water environment time-series prediction model based on dynamic feature mining of data.

[0005] The present invention addresses the problems of high complexity and low efficiency in existing water environment data prediction methods by providing a method for constructing a water environment time-series prediction model based on dynamic feature mining of data.

[0006] The present invention discloses a method for constructing a water environment time-series prediction model based on dynamic feature mining of data, comprising:

[0007] Step 1: Determine the target variables for the target watershed, collect historical data of the target variables and related variables from the water environment data of the target watershed, extract features from the historical data, clean and normalize the extracted feature data, and obtain the initial input dataset.

[0008] Step 2: Using the initial input dataset, perform grey relational analysis to analyze the correlation between the target variable and related variables, and select related variables whose correlation with the target variable exceeds a threshold as the final related variables;

[0009] Step 3: Use the Coupled Sparrow Optimization Algorithm (SSA) to optimize the parameters of the Seasonal Trend Time Series Decomposition Method (STL) for Local Weighted Regression. Then, use the optimized STL to decompose the historical data of the target variable into trend, seasonal, and residual terms.

[0010] Step 4: Build an LSTM model. Use the historical data of the trend term, seasonal term, remainder term and the final related variable as the final input dataset. Use the SSA algorithm to train the LSTM model with the final input dataset to obtain the SSA-STL-LSTM model.

[0011] When predicting actual variables, the final input dataset corresponding to the historical data at the current moment is used as the input to obtain the SSA-STL-LSTM model to predict the target variable.

[0012] Furthermore, in this invention, the method for cleaning and normalizing the extracted feature data to obtain the initial input dataset in step one is as follows:

[0013] We used a random forest regression model to fill in missing values ​​in the data, and used the Winsorizing method to remove outliers and delete erroneous values ​​from the data.

[0014] Furthermore, in this invention, in step two, the formula for using grey relational analysis to analyze the correlation between the features of the target variable and the features of related variables in the input dataset is as follows:

[0015]

[0016] In the formula X i Let X be the parameter of the i-th related variable. ik For the i-th related parameter X i The value at time k, where n represents n times, with each time point corresponding to one historical data point for the relevant variable, which is the number of historical data points for the relevant variable. A is the target variable. Let A be the mean of the target variable A. k Let r be the values ​​of the target variable at k time points. i For parameter X i The correlation coefficient with parameter A; Let be the average value of the i-th related variable;

[0017]

[0018] In the formula, GRD is the grey relational degree value, w i Let m be the weight of the i-th relevant parameter, and m be the number of relevant parameters.

[0019] Furthermore, in this invention, in step three, the formula for parameter optimization of the locally weighted regression seasonal trend time series decomposition method STL using the Coupled Sparrow Optimization Algorithm (SSA) is as follows:

[0020]

[0021] In the formula, t is the number of iterations, and x t i,j Let t be the latitude value of the i-th sparrow j in the t-th iteration. m The maximum number of iterations is preset, Q is a standard normally distributed random number, L is a 1×d matrix of all 1s, R2 is the safety threshold, and T0 is the alarm threshold.

[0022] Furthermore, in this invention, in step three, the formula for decomposing the historical data of the target variable into trend terms, seasonal terms, and residual terms using the parameter-optimized decomposition method STL is as follows:

[0023] A t =T t +S t +R t

[0024] In the formula, A t Let T be the data value of the target parameter A at time t. t Let S be the trend value at time t.t Let R be the seasonal term value at time t. t Let be the remainder value at time t.

[0025] This invention discloses a method for constructing a water environment time-series prediction model based on dynamic feature mining of data, and a method for water environment prediction. Through data preprocessing, data decomposition, feature selection, model tuning, and model validation, a machine learning water quality prediction model is constructed. It combines the interpretation of the trends and seasonality of environmental dynamic characteristics using STL, and optimizes the parameters of LSTM using the SAA sparrow optimization algorithm, achieving efficient simulation of dynamic features of water environment data, suitable for simulating water environment data with seasonal fluctuations. It maintains a high level of prediction accuracy and robustness, while improving the efficiency of water quality time-series dynamic feature mining. Attached Figure Description

[0026] Figure 1 This is a flowchart of the method described in this invention;

[0027] Figure 2a A trend graph of total nitrogen after SAA-STL decomposition;

[0028] Figure 2b This is a seasonal term diagram of total nitrogen after SAA-STL decomposition;

[0029] Figure 2c This is a graph showing the remaining terms after the decomposition of total nitrogen (SAA) by STL.

[0030] Figure 2d The time series plot shows the total nitrogen that has not been decomposed;

[0031] Figure 3 The MSE loss curve for the SSA-LSTM model;

[0032] Figure 4 This is a comparison chart of the actual total nitrogen value, the total nitrogen value predicted by LSTM, and the total nitrogen value predicted by STL-SSA-LSTM. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of the present invention can be combined with each other.

[0034] Specific Implementation Method 1: Refer to Figure Figure 1This embodiment describes a method for constructing a water environment time-series prediction model based on dynamic feature mining of data, including:

[0035] Step 1: Determine the target variables for the target watershed, collect historical data of the target variables and related variables from the water environment data of the target watershed, extract features from the historical data, clean and normalize the extracted feature data, and obtain the initial input dataset.

[0036] Step 2: Using the initial input dataset, perform grey relational analysis to analyze the correlation between the target variable and related variables, and select related variables whose correlation with the target variable exceeds a threshold as the final related variables;

[0037] Step 3: Use the Coupled Sparrow Optimization Algorithm (SSA) to optimize the parameters of the Seasonal Trend Time Series Decomposition Method (STL) for Local Weighted Regression. Then, use the optimized STL to decompose the historical data of the target variable into trend, seasonal, and residual terms.

[0038] Step 4: Build an LSTM model. Use the historical data of the trend term, seasonal term, remainder term and the final related variable as the final input dataset. Use the SSA algorithm to train the LSTM model with the final input dataset to obtain the SSA-STL-LSTM model.

[0039] When predicting actual variables, the final input dataset corresponding to the historical data at the current moment is used as the input to obtain the SSA-STL-LSTM model to predict the target variable.

[0040] Furthermore, in this invention, the method for cleaning and normalizing the extracted feature data to obtain the initial input dataset in step one is as follows:

[0041] We used a random forest regression model to fill in missing values ​​in the data, and used the Winsorizing method to remove outliers and delete erroneous values ​​from the data.

[0042] Furthermore, in this invention, in step two, the formula for using grey relational analysis to analyze the correlation between the features of the target variable and the features of related variables in the input dataset is as follows:

[0043]

[0044] In the formula X i Let X be the parameter of the i-th related variable. ik For the i-th related parameter X iThe value at time k, where n represents n times, and each time point corresponds to one historical data point for the relevant variable, which is the number of historical data points for the relevant variable. A is the target variable. Let A be the mean of the target variable A. k Let r be the values ​​of the target variable at k time points. i For parameter X i The correlation coefficient with parameter A; Let be the average value of the i-th related variable;

[0045]

[0046] In the formula, GRD is the grey relational degree value, w i Let m be the weight of the i-th relevant parameter, and m be the number of relevant parameters.

[0047] Furthermore, in this invention, in step three, the formula for parameter optimization of the locally weighted regression seasonal trend time series decomposition method STL using the Coupled Sparrow Optimization Algorithm (SSA) is as follows:

[0048]

[0049] In the formula, t is the number of iterations, and x t i,j Let t be the latitude value of the i-th sparrow j in the t-th iteration. m The maximum number of iterations is preset, Q is a standard normally distributed random number, L is a 1×d matrix of all 1s, R2 is the safety threshold, and T0 is the alarm threshold.

[0050] In this implementation, based on the target variable, the initial SSA algorithm is configured with parameters such as sparrow population size, number of iterations, warning value, observer probability, warning probability, and random learning rate. The fitness function of the STL algorithm is determined, and the trend window width and seasonal component window width ranges for SSA optimization of STL are preset. Through multiple iterations, the trend window width and seasonal component window corresponding to the optimal fitness are obtained, and finally, the original target variable is decomposed.

[0051] Furthermore, in this invention, in step three, the formula for decomposing the historical data of the target variable into trend terms, seasonal terms, and residual terms using the parameter-optimized decomposition method STL is as follows:

[0052] A t =T t +S t +R t

[0053] In the formula, A t Let T be the data value of the target parameter A at time t. t Let S be the trend value at time t. tLet R be the seasonal term value at time t. t Let be the remainder value at time t.

[0054] This invention also selects the Sparrow Optimization Algorithm (SSA) to achieve rapid optimization of LSTM model parameters, thereby improving both performance and efficiency while preserving the accuracy and stability of the LSTM model.

[0055] In this invention, the SSA-STL-LSTM model includes multiple units, which sequentially perform signal processing and transmission. Each unit includes a forget gate, an input gate, an output gate, and a state update module.

[0056] The forget gate determines the amount of the cell state from the previous time step that is retained in the cell state information at the current time step through the activation function;

[0057] The activation function:

[0058]

[0059] f t =σ(W f ·[h t-1 x t ]+b f )

[0060] In the formula, W f For the weight of the forget gate, b f For the deviation of the forget gate, [h t -1,x t [] represents the concatenation of the hidden state from the previous time step and the input from the current time step; σ(x) is the sigmoid function, f t This indicates the open / closed state of the forgetfulness gate;

[0061] The input gate is used to determine the input x at the current time. t Cell state C is preserved at the current moment. t The quantity in the equation is used to generate candidate cell states via the tanh function, which is:

[0062]

[0063]

[0064] In the formula W i b represents the weights of the input gate. i For the input gate deviation; i t This indicates the open / closed state of the input gate. For candidate cell states, b c As a bias term, W c This is the weight matrix;

[0065] The output gate is used to determine the original output information by pairwise multiplication with the cell state through the tanh layer, and finally obtains the output of the unit;

[0066]

[0067] In the formula W O b represents the weight of the output gate. O The deviation of the output gate; o t This indicates the on / off state of the output gate;

[0068] The state update module is used to update the cell state C. t and hidden state h t renew;

[0069]

[0070] h t =o t ·tanh(C t )

[0071] Where i represents the current on / off state of the input gate.

[0072] One of the challenges in mining dynamic features of water environment data lies in the dynamic changes of this data. Water environment data may exhibit trending and seasonal dynamic changes due to climate change, hydrological changes, natural periodic changes, human activities, and so on. Therefore, identifying and mining dynamic features is of great significance for coupling time series models to predict the future environment. Currently, the following methods exist for dynamic feature decomposition of water environment data: STL (Seasonal-Trend decomposition using LOESS) is a filtering process that decomposes time series data into trend, seasonal, and residual terms. It performs well for seasonal and trend-based decomposition, exhibits robustness against noise and outliers, and allows for rapid computation. However, its performance may be poor when seasonal variations are complex or nonlinear. SSA (Singular Spectrum Analysis) can effectively adapt to different types and complexities of time series data, but parameter selection has a significant impact on the decomposition results and is difficult to debug. It also performs poorly for decomposing long-term trends. VMD (Variational Mode Decomposition) performs well for decomposing nonlinear and non-stationary signals, possesses data adaptability and noise immunity, but suffers from high computational complexity and slow time efficiency when processing long-term and large-scale data. Furthermore, determining a suitable mode function is also necessary.

[0073] This invention proposes using the STL method to decompose the time series of the target variable into trend and seasonal terms. This technique not only helps to better understand the long-term trend and seasonality of the target variable's changes, but also has a fast computation speed. By decomposing the target variable, the predictive model's understanding of water quality changes is improved, providing a more accurate and reliable foundation for water quality prediction.

[0074] This invention utilizes an LSTM model and couples it with the SSA optimization algorithm to improve the accuracy and precision of the prediction model. This combination fully leverages the powerful learning ability of LSTM on time series data and optimizes the model parameters through the SSA optimization algorithm, enabling the model to better capture the complex features of the target variable's time series, thereby improving the accuracy and reliability of the prediction.

[0075] This invention proposes a model with good robustness, generalization ability, and stability. Through targeted training and validation, the model has been thoroughly optimized to ensure high efficiency and stable performance under different environmental conditions and data variations. This enables the model to effectively handle data noise and anomalies, maintaining a high level of predictive accuracy and robustness. The predictive model is not merely a forecasting tool, but also provides strong support for environmental water quality early warning and management. By accurately predicting the target variable, it provides important decision-making references for environmental management departments, enabling them to monitor and respond to water quality changes more promptly and formulate corresponding environmental protection strategies. This technological innovation is of great significance to water resource management and environmental protection, making a positive contribution to the development of environmental protection.

[0076] The following is an example of the construction of a prediction model for total nitrogen concentration at a river section in City B, Province A, in the Yellow River Basin. The specific implementation process is as follows:

[0077] (1) Data acquisition and preprocessing

[0078] Based on various statistical yearbooks, environmental bulletins, and publicly available data platforms, and considering various natural and anthropogenic factors affecting total nitrogen concentration in rivers, watershed environmental characteristic data were selected from aspects such as land use, normalized difference vegetation index, socio-economic factors, and climate and meteorology to construct a water quality prediction index system. After identifying missing values, a random forest regression model was used to fill in the missing values. Outliers were removed using the Winsorizing method, and high and low outliers were replaced with the 97.5% quantile and 2.5% quantile, respectively. The cleaned data was then organized according to the input data format requirements of the LSTM model to generate a dataset.

[0079] (2) The target variable, total nitrogen, was processed using STL.

[0080] In Python, the `statsmodels.tsa` time series analysis library from the NumPy library was used to perform STL decomposition on the total nitrogen concentration in this case. During STL decomposition, two smoothing parameters are required: the window width for the trend and the seasonal components. This case extracts the long-term trend and seasonal patterns of total nitrogen concentration from fluctuating observation data to aid in the assessment of total nitrogen control policies. Based on the diagnostic graphical method, the trend and seasonal window widths were determined to be 9 months and 11 months, respectively, as shown in Figure 2.

[0081] (3) Correlation of parameters and dataset construction

[0082] Grey relational analysis was used to calculate the correlation coefficients between the target variable, total nitrogen concentration, and other environmental parameters. Environmental parameters with high correlations were identified, such as cultivated land TDA, forest land TDB, grassland TDC, water area TDD, urban and rural industrial and residential land TDE, normalized difference vegetation index (NDVI), population (POP), gross domestic product (GDP), temperature (TMP), precipitation (PRE), water temperature (WT), and pH. Features with correlation coefficients greater than 0.6 were selected. The corresponding dates, the six environmental parameters, and the three environmental features (trend, seasonal, and residual items) obtained from STL decomposition were used as the input dataset for an LSTM model, with TN concentration as the output.

[0083] Table 2. Correlation between total nitrogen and other environmental parameters

[0084]

[0085] (4) Establish an SSA-LSTM prediction model

[0086] The dataset was divided into training and test sets in a ratio of 0.7:0.3. The training set data was used as input, and the Sparrow Search Algorithm (SSA) was used to optimize the parameters of the LSTM neural network, such as the learning rate, number of training iterations, number of layers, and list of neurons. The optimal LSTM model parameters were determined through optimization, and the specific parameter optimization range and results are shown in Table 2.

[0087] Table 3 SSA parameter tuning range and optimal parameter values

[0088]

[0089]

[0090] The LSTM model was trained using the optimal parameter combinations shown in Table 2, and the MSE loss curves for each training iteration on the training and test sets were obtained, as follows: Figure 3 As shown.

[0091] The SSA-LSTM model was evaluated on the test set, and the evaluation metrics included the coefficient of determination r.2 The evaluation metrics for the SSA-LSTM model in this case are: mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE). Table 3 shows that the SSA-LSTM model has high accuracy in predicting total nitrogen concentration, and the test set r... 2 It reached 0.8489.

[0092] Table 4 Evaluation of the total nitrogen concentration prediction model based on SSA-LSTM

[0093]

[0094] (5) Water quality prediction

[0095] The predicted data are substituted into the SSA-LSTM water quality prediction model constructed by the optimal parameter combination in step (4) to predict the total nitrogen concentration. The distribution of the model's simulated data and measured data for total nitrogen concentration is as follows: Figure 4 As shown, the simulated values ​​are basically distributed around the measured values, indicating that the SSA-LSTM model based on STL decomposition has a stable simulation effect on the total nitrogen concentration of rivers.

[0096] While the invention has been described herein with reference to specific embodiments, it should be understood that these embodiments are merely examples of the principles and applications of the invention. Therefore, it should be understood that many modifications can be made to the exemplary embodiments, and other arrangements can be designed without departing from the spirit and scope of the invention as defined by the appended claims. It should be understood that different dependent claims and features described herein can be combined in ways different from those described in the original claims. It is also understood that features described in conjunction with individual embodiments can be used in other described embodiments.

Claims

1. A method for constructing a water environment time-series prediction model based on dynamic feature mining of data, characterized in that, include: Step 1: Determine the target variables for the target watershed, collect historical data of the target variables and related variables from the water environment data of the target watershed, extract features from the historical data, clean and normalize the extracted feature data, and obtain the initial input dataset; Step 2: Using the initial input dataset, perform grey relational analysis to analyze the correlation between the target variable and related variables, and select related variables whose correlation with the target variable exceeds a threshold as the final related variables; Step 3: Use the Coupled Sparrow Optimization Algorithm (SSA) to optimize the parameters of the Seasonal Trend Time Series Decomposition Method (STL) for Local Weighted Regression. Then, use the optimized STL to decompose the historical data of the target variable into trend, seasonal, and residual terms. Step 4: Build an LSTM model. Use the historical data of the trend term, seasonal term, remainder term and the final related variable as the final input dataset. Use the SSA algorithm to train the LSTM model with the final input dataset to obtain the SSA-STL-LSTM model. When predicting actual variables, the final input dataset corresponding to the historical data at the current moment is used as the input to obtain the SSA-STL-LSTM model to predict the target variable; In step three, the formula for parameter optimization of the Seasonal Trend Time Series Decomposition Method (STL) using the Coupled Sparrow Optimization Algorithm (SSA) is as follows: In the formula, t is the number of iterations. x t i,j for t The second iteration i A sparrow j Latitude value t m The maximum number of iterations is preset, and Q is a standard normally distributed random number. L R2 is a 1×d matrix consisting entirely of 1s, where R2 is the safety threshold and T0 is the alarm threshold. In step three, the formula for decomposing the historical data of the target variable into trend, seasonal, and residual terms using the STL decomposition method after parameter optimization is as follows: In the formula, A t Let T be the data value of the target parameter A at time t. t Let S be the trend value at time t. t Let R be the seasonal term value at time t. t Let be the remainder value at time t.

2. The method for constructing a water environment time-series prediction model based on dynamic feature mining of data according to claim 1, characterized in that, In step one, the extracted feature data is cleaned and normalized to obtain the initial input dataset. We use a random forest regression model to fill in missing values ​​in the data, remove outliers based on tail reduction, and uniformly delete erroneous values ​​in the data.

3. A method for constructing a water environment time-series prediction model based on dynamic feature mining of data, as described in claim 1 or 2, characterized in that, In step two, the formula for using grey relational analysis to analyze the correlation between the characteristics of the target variable and the characteristics of related variables in the input dataset is as follows: In the formula X i For the first i One related variable parameter, For the first i One relevant parameter X i In the k The values ​​at each time point, where n represents n time points, and each time point corresponds to a historical data point for a relevant variable. A is the target variable. Let A be the mean of the target variable A. For target variable k The value at each moment. r i For parameters X i The correlation coefficient with parameter A; For the first i The average of the relevant variables; In the formula, GRD is the grey relational degree value. w i For the first i The weights of each relevant parameter, m This represents the number of relevant parameters.