A daily precipitation grade classification method based on GA-XGBoost

By balancing the dataset using the smote method and optimizing hyperparameters using a genetic algorithm, a GA-XGBoost model was constructed. This solved the problems of insufficient data and limitations in parameter setting in the classification of daily precipitation levels, and improved the prediction accuracy and model robustness.

CN116738322BActive Publication Date: 2026-06-23SHENYANG UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENYANG UNIVERSITY OF TECHNOLOGY
Filing Date
2023-07-13
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies for classifying daily precipitation levels suffer from insufficient mining of the temporal characteristics of precipitation data and limitations in manually setting network parameters, resulting in low prediction accuracy. Furthermore, traditional methods struggle to handle complex, multidimensional meteorological data.

Method used

The dataset was balanced using the smote method, an XGBoost model was built, and the hyperparameters were optimized using a genetic algorithm to construct a GA-XGBoost model for daily precipitation classification.

Benefits of technology

The accuracy of daily precipitation classification and the robustness of the model were improved, and the prediction accuracy was enhanced. The accuracy and F1 score of the GA-XGBoost model were improved by approximately 32.83%, 18.18%, and 10.14% respectively compared with RF, GBDT, and XGBoost, and the F1 score was improved by approximately 16.54%, 8.98%, and 5.34% respectively.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116738322B_ABST
    Figure CN116738322B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of deep learning and meteorological prediction, and particularly relates to a daily precipitation grade classification method based on GA-XGBoost. The method realizes autonomous learning of time sequence characteristics of precipitation, reduces the non-stability of precipitation data, and accurately classifies and predicts daily precipitation. The method comprises the following steps: preprocessing original precipitation data, including data screening, data cleaning, data classification and smote method balanced dataset; establishing an XGBoost model and initializing hyperparameters; using a genetic algorithm to optimize network parameters; inputting each subsequence into the model for training and prediction and comparing and analyzing the results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of deep learning and meteorological forecasting technology, and in particular relates to a daily precipitation level classification method based on GA-XGBoost. Background Technology

[0002] Precipitation is a highly complex nonlinear process characterized by randomness, suddenness, and locality. It is also easily affected by factors such as wind speed, temperature, air pressure, and terrain, resulting in low prediction accuracy. Accurate precipitation classification is of great significance for water resource utilization and storage, as well as urban construction.

[0003] Because the meteorological conditions for precipitation are highly complex and easily influenced by factors such as topography and altitude, precipitation forecasting is very difficult, and the accuracy of forecasts has always been a focus of researchers. Currently, with the rapid development of artificial intelligence technology, deep learning algorithms, which can efficiently process the feature information of massive datasets, learn the interaction patterns between meteorological elements, and more accurately describe the nonlinear changes in precipitation, have been widely applied by researchers to precipitation classification and forecasting.

[0004] In the field of precipitation forecasting, traditional forecasting methods are mostly based on mathematical models to describe meteorological changes, thus exhibiting strong characteristics of being formulaic and formulaic. They fail to consider the nonlinear evolution of precipitation, lack good generalization ability, and struggle to achieve ideal forecasting results. Furthermore, traditional forecasting methods are significantly inadequate for processing massive, complex, and multidimensional meteorological data. Therefore, using artificial intelligence technology to assist or replace traditional forecasting methods has become an inevitable trend. Summary of the Invention

[0005] To address the limitations of current methods for classifying daily precipitation levels, such as the one-sidedness and insufficiency in mining the temporal characteristics of precipitation data and the constraints of manually setting network parameters, this invention provides a GA-XGBoost-based method for daily precipitation level classification. This method enables the autonomous learning of the temporal characteristics of precipitation, reduces the instability of precipitation data, and accurately classifies and predicts daily precipitation.

[0006] This invention focuses on daily precipitation. First, the `smote` method is used to oversample the training set of precipitation data, increasing the number of minority class samples and balancing the dataset. The test set is not oversampled. Second, an XGBoost model is established, and its hyperparameters are initialized. Finally, because genetic algorithms have a mutation mechanism, they can reduce the risk of the model getting trapped in local optima during training. Furthermore, the randomness of parameter updates makes them more robust and suitable for more complex solution processes; therefore, a genetic algorithm is used to optimize the XGBoost hyperparameters. Thus, a daily precipitation classification and prediction model based on GA-XGBoost is established, and experiments are conducted to compare it with RF, GBDT, and XGBoost, summarizing the model's advantages and disadvantages.

[0007] To achieve the above objectives, the present invention adopts the following technical solution, characterized in that it includes:

[0008] Preprocessing of (raw) precipitation data includes data filtering, data cleaning, data classification, and balancing of the dataset using the Smote method.

[0009] Build the XGBoost model and initialize the hyperparameters.

[0010] Genetic algorithms are used to optimize network parameters, and each subsequence is input into the model for training and prediction. The results are then compared and analyzed.

[0011] Furthermore, the data filtering includes: obtaining the contribution of each feature in the precipitation data to the precipitation amount, and discarding the features with the lowest contribution ranking.

[0012] Furthermore, the data cleaning includes: for the feature of missing values ​​in the data, during processing, searching for missing values ​​column by column and replacing them with the average value using the fillna function and mean method.

[0013] Furthermore, the data classification includes dividing precipitation into seven categories: no rain, light rain, moderate rain, heavy rain, rainstorm, heavy rainstorm, and extremely heavy rainstorm (the classification criteria are shown in Table 1). Because the dataset used is at a daily scale, it is classified and labeled according to the 24-hour time period classification criteria, with the label categories being: 0, 1, 2, 3, 4, 5, and 6.

[0014] Table 1. Precipitation Level Classification Standards (mm)

[0015]

[0016] Furthermore, the smote method for balancing the dataset includes:

[0017] (Because there is an imbalance in the number of different types of precipitation data, the model will pay less attention to the minority class of samples, which will affect the model performance;) Use the smote oversampling method to increase the number of minority class samples, and import the SMOTE function library to oversample the training set.

[0018] Furthermore, the step of using a genetic algorithm to optimize network parameters, inputting each subsequence into the model for training and prediction, and comparing and analyzing the results includes:

[0019] Constructing a BiLSTM network model: Initialize the network parameters of the BiLSTM model, including the number of neurons in the forward and backward layers, the learning rate, and the number of neurons in the fully connected layer; set the network structure, including the input layer, BiLSTM layer, tiling layer, and fully connected layer.

[0020] 1) In the genetic algorithm, the chromosome encoding method is defined, the population is initialized, the fitness function is set, and the genetic operations (including selection, crossover, and mutation) are performed. In the genetic algorithm optimization process, the parameters to be optimized include: number of iterations, maximum tree depth, L1 regularization, L2 regularization, model complexity penalty, and the percentage of training samples selected in each iteration. The parameter optimization space is defined, the genetic population is set, and the updated population is used to initialize the XGBoost model in each iteration.

[0021] 2) Input the divided training set from the precipitation data into the XGBoost model for training. Use the training accuracy as the fitness function value of the genetic algorithm, compare the individual fitness values, update and save the best individual.

[0022] 3) Input the pre-divided test set from the precipitation data into the GA-XGBoost model for prediction, and output the daily precipitation level classification results.

[0023] Compared with the prior art, the present invention has the following advantages.

[0024] This invention has significant academic value and practical application reference value for the research direction of daily precipitation classification and prediction. It addresses the problem that the lack of stationarity in daily precipitation data leads to insufficient extraction of temporal features by the model, and the fact that the classic XGBoost model requires manual hyperparameter setting, which is inefficient and cumbersome. This method uses daily precipitation data from the Chaoyang District Observatory in Chaoyang City, Liaoning Province as the research object. First, the smote method is used to oversample the training set of precipitation data to increase the number of minority class samples and balance the dataset; the test set is not oversampled. Then, an XGBoost model is established, and the hyperparameters are initialized. Finally, a genetic algorithm is combined to optimize the hyperparameters, establishing a daily precipitation level classification model based on GA-XGBoost. Comparative analysis with XGBoost, RF, and GBDT shows that the genetic algorithm can obtain more suitable network parameters, improve prediction accuracy, enhance model robustness, verify the effectiveness and feasibility of the genetic algorithm in optimizing XGBoost, and improve the model's classification prediction accuracy.

[0025] The GA-XGBoost model presented in this invention achieves a higher accuracy of 0.793 compared to other models, representing improvements of approximately 32.83%, 18.18%, and 10.14% over RF, GBDT, and XGBoost, respectively. Its F1 score reaches 0.789, representing improvements of approximately 16.54%, 8.98%, and 5.34% over the other three. The GA-XGBoost model demonstrates that it improves recall without significantly reducing precision, indicating that this invention can enhance classification accuracy and provides a new approach for daily precipitation classification and prediction. Attached Figure Description

[0026] The present invention will be further described below with reference to the accompanying drawings and specific embodiments. The scope of protection of the present invention is not limited to the following description.

[0027] Figure 1 Rank the features contributed by XGBoost.

[0028] Figure 2 Rank the feature contributions of RF.

[0029] Figure 3 The contribution of GBDT features.

[0030] Figure 4 This is the confusion matrix of the model classification results for the Chaoyang District Observation Station in Chaoyang City. Detailed Implementation

[0031] The present invention will be further described in detail below with reference to the embodiments.

[0032] The daily precipitation classification method based on GA-XGBoost is implemented in three steps.

[0033] First step: Data preprocessing: data filtering, data cleaning, data classification, and balancing the dataset using the SMOTE method.

[0034] The second step: Build the XGBoost model and initialize the hyperparameters.

[0035] The third step is to use a genetic algorithm to optimize the network parameters, input each subsequence into the model for training and prediction, and compare and analyze the experimental results.

[0036] The specific process is as follows:

[0037] The first step involved using surface-based precipitation data provided by the Liaoning Provincial Meteorological Bureau. Data from the Chaoyang District observation station in Chaoyang City was used as the experimental subject, totaling 15,313 data points. This data was divided into a training set (70%) and a test set (30%), with the training set containing 10,719 data points and the test set containing 4,594 data points. The experiment involved inputting various meteorological elements to learn the interaction mechanisms between these elements, thus conducting a daily precipitation level classification study. Preprocessing was then performed on the precipitation data.

[0038] 1) Data Filtering: Because some meteorological elements do not have a significant correlation with precipitation, inputting all of them into the model would cause information redundancy, consume computational resources, and affect training efficiency. Therefore, it is necessary to obtain the contribution of each feature to precipitation to determine which features should be discarded. This invention uses three models—XGBoost, RF, and GBDT—to output the contribution ranking of each feature in the dataset. Since the sklearn framework in Python includes function libraries for these three models, and each library has a built-in `feature_importances_` method that can visualize the feature contribution ranking, this method is called in the experiment to output the contribution of each meteorological element to precipitation. Taking data from the Chaoyang District observation station in Liaoning Province as an example, the obtained feature contribution is shown below. Figures 1-3 By comprehensively comparing the output results of the three models, it was found that the contribution of the SNDP feature was extremely small in each model, so it was removed. This method was applied to data from other sites as well; based on the results, the SNDP feature was removed from all other sites selected in the experiment.

[0039] 2) Data cleaning: Some features in the data have default values. For example, the GUST feature value of 999.9 indicates a default value. During processing, default values ​​are retrieved column by column, and the fillna function and mean method are used to replace them with the average value.

[0040] 3) Data Classification: According to the precipitation classification method provided by the Liaoning Provincial Meteorological Observatory, precipitation is typically divided into seven categories: no rain, light rain, moderate rain, heavy rain, torrential rain, extremely heavy rain, and exceptionally heavy rain. Since the dataset used is at a daily scale, it is classified and labeled according to the 24-hour time period classification standard. The label categories are: 0, 1, 2, 3, 4, 5, and 6. Taking the data from the Chaoyang District observation station in Liaoning Province as an example, because the maximum precipitation at this station did not exceed 250 mm, it is divided into six categories.

[0041] 4) The smote method balances the dataset.

[0042] Because of the imbalance in the number of samples across different classes in precipitation data, the model may pay less attention to the minority class, affecting model performance. This invention uses the SMOTE oversampling method to increase the number of minority class samples, importing the SMOTE function library to oversample the training set. The basic idea of ​​this algorithm is to calculate the distance between all minority class samples in the k-neighborhood for each minority class sample using Euclidean distance as the standard, and then randomly select a point on the line connecting the two samples as the newly generated minority class sample.

[0043] The second step is to build the XGBoost model and initialize the hyperparameters.

[0044] XGBoost is an ensemble learning algorithm, essentially based on the gradient boosting algorithm, using decision trees as weak classifiers. It is an implementation and extension of the Gradient Boosting Decision Tree (GBDT) algorithm. The main process involves using the gradient of a general loss function to fit an approximation of the residual. The core idea is: after training T1 to Tt-1 trees, the first Tt-1 trees are not adjusted; only the current t-th tree is adjusted to fit the residual.

[0045] The loss function of the XGBoost algorithm is expressed in equation (1).

[0046]

[0047] In equation (1), Represents the true value y i Compared with the predicted value The sum of training errors between them. The total complexity of the tree model is represented and added as a regularization term to the loss function to prevent overfitting. Together, they constitute the overall optimization objective of the algorithm, i.e., the loss function.

[0048] Because the XGBoost algorithm is a boosting algorithm, it follows the forward step-by-step addition method, that is, the prediction value of each tree is the sum of the prediction values ​​obtained from all the previous trees and the prediction value of the current tree. In other words, XGBoost focuses on the optimization of the t-th tree, and the model's prediction value for the t-th tree is shown in Equation (2).

[0049]

[0050] In equation (2), the prediction process is the first T t-1 Predicted values ​​obtained after training the trees The functional relationship f between the current t-th tree and the input feature variable x. t The sum of (x), where f t (x) represents the predicted value of the t-th tree. Therefore, the optimization objective of the t-th tree is expressed as shown in equation (3).

[0051]

[0052] The last term in equation (3) represents the model complexity of the first t-1 trees, which is a fixed value. Therefore, it is not necessary to optimize this term in the next training step. The objective function can be approximated by using a second-order Taylor expansion, so the expression can be rewritten as shown in equation (4).

[0053]

[0054] In equation (4), This represents the first derivative of the objective function. This represents the second derivative of the objective function, because in the formula... Let represent the total loss error between the actual and predicted values ​​of the first t-1 trees. Since this is a fixed value, terms that can be considered constants are removed in the subsequent derivation. The simplified version of the above equation is shown in equation (5).

[0055]

[0056] Then define the model of the t-th tree as: f t (x)=w q(x) Integrate all leaf nodes: I j ={i|q(x i )=j}. w represents the weight of the leaf node. q represents the structure of the current tree. After inputting the feature variable x, q can map it to a certain leaf node. The regularization term for the model complexity is defined in equation (6).

[0057]

[0058] In equation (6), γ and λ are penalty coefficients, T is the number of leaf nodes, and w j Let be the output vector of leaf node j. Therefore, the objective function can be obtained as shown in equation (7).

[0059]

[0060] definition The objective function is shown in equation (8).

[0061]

[0062] Equation (8) is equivalent to a quadratic function. The minimum value is obtained at that point. Therefore, when When the minimum loss function is obtained, it is shown in equation (9).

[0063]

[0064] A smaller loss function indicates better model performance. In actual training, when building the t-th tree, it's necessary to determine the optimal split point for the leaf nodes. A greedy algorithm is used for node splitting.

[37] .

[0065] Start with a tree depth of 0 and follow these steps.

[0066] (1) Enumerate all available features for each leaf node;

[0067] (2) For each feature, the training samples of the node are sorted in ascending order according to the size of the feature value. The best split point of the feature is selected by traversal and the split gain of the feature is saved.

[0068] (3) Select the feature with the highest benefit as the segmentation feature, and use the best split point of the feature as the splitting position to split into two new leaf nodes, and add the corresponding sample data to each new node.

[0069] (4) Return to step 1 and continue recursively until a specific condition is met.

[0070] When splitting a leaf node, the objective function before splitting is shown in equation (10).

[0071]

[0072] In equation (10), G L G R H represents the first derivative of the objective function corresponding to the left and right subtree nodes. L H R The second derivative of the objective function corresponding to the left and right subtree nodes is represented.

[0073] The objective function after splitting is shown in equation (11).

[0074]

[0075] The gain after splitting the objective function is shown in equation (12).

[0076]

[0077] This splitting gain is also a basis for judging the importance of features.

[0078] The third step involves using a genetic algorithm to optimize the network parameters, inputting each subsequence into the model for training and prediction, and then comparing and analyzing the experimental results.

[0079] In the genetic algorithm, the optimized parameters are: number of iterations, maximum tree depth, L1 regularization, L2 regularization, model complexity penalty, and the percentage of training samples selected in each iteration. A parameter optimization space is defined, a genetic population is set up, and the XGBoost model is initialized with the updated population in each iteration. Table 2 shows the optimization of XGBoost hyperparameters using the genetic algorithm, a description of the hyperparameters, the range of settings during optimization, and the optimized parameter values.

[0080] Table 2 Model Parameter Settings

[0081]

[0082] The genetic operations are as follows:

[0083] In the selection operation, a league selection algorithm is adopted. The main idea is to randomly select k competitors to compete for the inheritance of each gene in the mating pool. The one with the best fitness will obtain the right to inherit the gene.

[0084] In the crossover operation, a simulated binary crossover algorithm (SBX), also known as single-point crossover, is used. A position is randomly selected from the gene loci of two chromosomes with binary codes, and the right-hand portions are swapped, resulting in two new chromosomes. Assume two parent individuals... and The SBX operator is then used to generate two descendant individuals. and The process is as follows:

[0085]

[0086]

[0087]

[0088] In the formula, rand∈U(0,1), η is the distribution factor, which is a user-defined parameter. The larger η is, the greater the probability that the offspring individuals will be closer to the parent individuals.

[0089] In the mutation operation, a polynomial mutation method is chosen, which has the same probability distribution as the SBX operator. The mutation form is defined as follows:

[0090] v′ k =v k +δ×(uk -l k (14)

[0091]

[0092] In the formula, u∈U(0,1), δ1=(v k -l k ) / (u k -l k ), δ2=(u k -v k ) / (u k -l k ), η m It is the distribution index.

[0093] To objectively evaluate the accuracy and effectiveness of the model in this invention, accuracy, recall, precision, and F1 score are selected to evaluate the model's performance from multiple perspectives. Recall represents the proportion of positive samples that are classified as positive out of the total number of positive samples; precision represents the number of true positive samples among those classified as positive; and the F1 score is the harmonic mean, balancing precision and recall. Taking binary classification as an example, the definitions of the evaluation metrics are explained. For positive samples P and negative samples N, classifying a positive sample as positive is defined as True Positive (TP), classifying a positive sample as negative is defined as False Negative (FN), classifying a negative sample as positive is defined as False Positive (FP), and classifying a negative sample as negative is defined as True Negative (TN). The corresponding confusion matrices are shown in Table 3.

[0094] Table 3 Confusion Matrix

[0095] Tab.3 Confusion matrix

[0096]

[0097] Their expressions are as follows:

[0098]

[0099]

[0100]

[0101]

[0102] In the experiments of this invention, RF, GBDT, XGBoost, and GA-XGBoost models were used to train and test the above data. The comparison of model evaluation metrics is shown in Table 4, and the confusion matrix of the test set classification results is shown in [Table 4]. Figure 4 .

[0103] Experimental results show that:

[0104] As shown in Table 4, the accuracy of the GA-XGBoost model reached 0.811, which is an improvement of approximately 37.69%, 17.03%, and 9.30% compared to RF, GBDT, and XGBoost, respectively. Although the precision is slightly lower than that of GBDT and XGBoost, the F1 score reached 0.810, which is an improvement of approximately 19.29%, 8.29%, and 3.85% compared to RF, GBDT, and XGBoost, respectively. Since the recall rate is improved, the precision rate will decrease accordingly. However, the GA-XGBoost model of this invention has a better F1 score, indicating that using a genetic algorithm to optimize the hyperparameters of XGBoost can effectively obtain a suitable combination of hyperparameters, which has a certain effect on improving the model performance.

[0105] from Figure 4 It can be seen that the GA-XGBoost model improves the classification accuracy for small amounts of precipitation, and the overall number of correctly classified cases is greater than that of the XGBoost, RF, and GBDT models. However, the classification accuracy for large amounts of precipitation is low for all four models. This may be because the proportion of large precipitation samples is small. Although oversampling was used to balance the dataset during training, there are many interpolated data in the samples of categories 4 and 5 in the training set, while there are also very few samples of this category in the test set. These samples do not have much similarity to the samples of this category in the training set, which may lead to a larger test error for the model in this category.

[0106] Table 4 Model Evaluation Indicators

[0107]

[0108] In summary, the accuracy of RF, GBDT, XGBoost, and GA-XGBoost increases sequentially, indicating that the model based on the boosting ensemble strategy is more suitable for the research content of this invention. The GA-XGBoost model of this invention performs better in terms of accuracy, recall, and F1 score across various sites, resulting in better overall classification performance. This verifies the feasibility and effectiveness of using genetic algorithms to optimize XGBoost hyperparameters, as well as the generalization ability of the GA-XGBoost model in daily precipitation level classification. However, the number of correctly classified categories 4 and 5 by GA-XGBoost decreases, and the accuracy of other base models for these categories needs improvement. This may be due to the large difference in sample size, which affects model performance. Although data balancing was performed, the oversampling of the training set using the smote method may have increased the number of small samples by a large amount of interpolated data. This interpolated data has little correlation with the test set samples, resulting in lower classification accuracy for these categories. Further research is needed to improve the model and enhance classification accuracy.

[0109] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Therefore, these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.

Claims

1. A daily precipitation level classification method based on GA-XGBoost, characterized in that: Precipitation data is preprocessed, including data filtering, data cleaning, data classification, and dataset balancing using the Smote method. Build the XGBoost model and initialize the hyperparameters; Genetic algorithms are used to optimize network parameters. Each subsequence is input into the model for training and prediction, and the results are compared and analyzed. Specifically, this includes: Constructing a BiLSTM network model: Initialize the network parameters of the BiLSTM model, including the number of neurons in the forward and backward layers, the learning rate, and the number of neurons in the fully connected layer; Set the network structure, including the input layer, BiLSTM layer, tiling layer, and fully connected layer; In the genetic algorithm, the chromosome encoding method is defined, the population is initialized, the fitness function is set, and the genetic operation process is performed. In the genetic algorithm optimization process, the parameters to be optimized include: number of iterations, maximum tree depth, L1 regularization, L2 regularization, model complexity penalty, and the percentage of training samples selected in each iteration. The parameter optimization space is defined, the genetic population is set, and in each iteration, the updated population is used to initialize the XGBoost model. The training set divided from the precipitation data is input into the XGBoost model for training. The training accuracy is used as the fitness function value of the genetic algorithm. The fitness values ​​of individuals are compared, and the best individual is updated and saved. The pre-defined test set from the precipitation data is input into the GA-XGBoost model for prediction, and the daily precipitation level classification results are output.

2. The daily precipitation level classification method based on GA-XGBoost according to claim 1, characterized in that: The data filtering includes: obtaining the contribution of each feature in the precipitation data to the precipitation amount, and discarding the features with the lowest contribution.

3. The daily precipitation level classification method based on GA-XGBoost according to claim 1, characterized in that: The data cleaning includes: targeting the feature of missing values ​​in the data, during processing, searching for missing values ​​column by column and replacing them with the average value using the fillna function and mean method.

4. The daily precipitation level classification method based on GA-XGBoost according to claim 1, characterized in that: The data classification includes dividing precipitation into seven categories: no rain, light rain, moderate rain, heavy rain, rainstorm, heavy rainstorm, and extremely heavy rainstorm. Because the dataset used is at a daily scale, it is classified and labeled according to the 24-hour time period classification standard. The label categories are: 0, 1, 2, 3, 4, 5, 6.

5. The daily precipitation level classification method based on GA-XGBoost according to claim 1, characterized in that: The smote method for balancing the dataset includes: using the smote oversampling method to increase the number of minority class samples, and importing the SMOTE function library to oversample the training set.