Imbalanced incremental learning-based data set classification method and system

By combining variational autoencoders and incremental Adaboost classifiers in an unbalanced incremental learning method, the inefficiency and inaccurate classification caused by dynamic changes in dataset categories are solved, achieving effective identification of minority classes and efficient classification of datasets.

WO2026129532A1PCT designated stage Publication Date: 2026-06-25SHIHEZI UNIVERSITY

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SHIHEZI UNIVERSITY
Filing Date
2025-04-29
Publication Date
2026-06-25

Smart Images

  • Figure CN2025092000_25062026_PF_FP_ABST
    Figure CN2025092000_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Provided are an imbalanced incremental learning-based data set classification method and system. The method comprises: acquiring an original data set, and preprocessing the original data set to obtain a training set and a test set (S10); determining an imbalanced incremental learning model, and on the basis of the training set and the test set, performing model training and model optimization on the imbalanced incremental learning model to obtain a data set classification model (S20); and acquiring a current data set to be processed, inputting said current data set into the data set classification model, and outputting a data set classification result (S30). The data set classification model is obtained by means of training and optimizing the imbalanced incremental learning model, and the classification efficiency and accuracy of the data set are effectively improved by means of the data set classification model.
Need to check novelty before this filing date? Find Prior Art

Description

A dataset classification method and system based on unbalanced incremental learning Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a dataset classification method, system, terminal, and computer-readable storage medium based on unbalanced incremental learning. Background Technology

[0002] With the development of the times, dynamic data environments are widely present in various application scenarios. Data not only flows in continuously, but is also often accompanied by the problem of category imbalance.

[0003] Traditional data processing models struggle to effectively handle dynamic changes in the categories within a dataset, especially in imbalanced data scenarios. This often results in decreased model performance and weak ability to identify minority classes, leading to low classification efficiency and inaccurate classification results.

[0004] Therefore, existing technologies still need to be improved and developed. Summary of the Invention

[0005] The main objective of this invention is to provide a dataset classification method, system, terminal, and computer-readable storage medium based on unbalanced incremental learning. This invention aims to solve the problem that traditional data processing models in the prior art are unable to effectively cope with dynamic changes in categories within a dataset, resulting in low classification efficiency and inaccurate classification results.

[0006] To achieve the above objectives, this invention provides a dataset classification method based on imbalanced incremental learning, which includes the following steps:

[0007] Obtain the original dataset and preprocess it to obtain the training set and the test set;

[0008] A non-equilibrium incremental learning model is determined, and the non-equilibrium incremental learning model is trained and optimized based on the training set and the test set to obtain a dataset classification model.

[0009] Obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

[0010] Optionally, the dataset classification method based on unbalanced incremental learning, wherein obtaining the original dataset and preprocessing the original dataset to obtain the training set and test set specifically includes:

[0011] Obtain the original dataset and perform data cleaning on the original dataset to obtain the first dataset;

[0012] Determine whether the first dataset is a categorical feature. If so, use one-hot encoding to perform numerical conversion on the first dataset to obtain the second dataset.

[0013] The second dataset is then divided into a training set and a test set.

[0014] Optionally, the dataset classification method based on unbalanced incremental learning further includes, after determining whether the first dataset is a classification feature:

[0015] If the first dataset is not a categorical feature, then determine whether the first dataset is a numerical feature;

[0016] If so, the first dataset is normalized using a standardization tool to obtain the third dataset;

[0017] The third dataset is then divided into training and test sets.

[0018] Optionally, the dataset classification method based on unbalanced incremental learning includes a variational autoencoder and an incremental adaptive adjustment classifier.

[0019] The process of determining an imbalanced incremental learning model and training and optimizing the model using the training set and the test set to obtain a dataset classification model specifically includes:

[0020] The training set is input into the variational autoencoder in the unbalanced incremental learning model, and the variational autoencoder performs feature extraction processing on the training set to obtain the mean and variance of the latent variables.

[0021] The mean and variance of the latent variables are input into the incremental adaptive adjustment classifier. The incremental adaptive adjustment classifier is used to perform classification training on the mean and variance of the latent variables to obtain the initial dataset classification model.

[0022] The initial dataset classification model is evaluated using the test set to obtain model performance evaluation results. The model performance evaluation process includes classification performance evaluation, residual analysis, robustness testing, and feature importance evaluation.

[0023] Based on the model performance evaluation results, the initial dataset classification model is optimized to obtain a dataset classification model.

[0024] Optionally, in the dataset classification method based on non-equilibrium incremental learning, the variational autoencoder includes an encoder and a decoder.

[0025] The step of inputting the training set into the variational autoencoder in the non-equilibrium incremental learning model, and performing feature extraction processing on the training set through the variational autoencoder to obtain the mean and variance of the latent variables, specifically includes:

[0026] The training set is input into the encoder in the variational autoencoder, and the encoder performs data mapping and gradient descent optimization on the training set to obtain latent variables.

[0027] The latent variables are input into the decoder, and the decoder performs remapping processing on the latent variables to obtain reconstructed data;

[0028] Calculate the reconstruction error and relative entropy between the training set and the reconstructed data, and weight the reconstruction error and the relative entropy to obtain the overall loss function;

[0029] The variational autoencoder is optimized based on the overall loss function to obtain the optimized variational autoencoder, and the mean and variance of the latent variables are output.

[0030] Optionally, the dataset classification method based on unbalanced incremental learning, wherein the step of inputting the mean and variance of the latent variables into the incremental adaptive adjustment classifier, and performing classification training on the mean and variance of the latent variables through the incremental adaptive adjustment classifier to obtain an initial dataset classification model, specifically includes:

[0031] The mean and variance of the latent variables are initialized with sample weights to obtain the initialized sample weights;

[0032] Multiple preset first classifiers are determined, and multiple preset first classifiers are trained according to the initial sample weights to obtain multiple second classifiers;

[0033] Calculate the weighted error rate of each second classifier on the training set, and calculate the classifier weight of each second classifier based on the weighted error rate;

[0034] The classification results of multiple second classifiers are obtained, and the classifier weights are updated according to the classification results and the weighted error rate to obtain an optimized incremental adaptive adjustment classifier.

[0035] The initial dataset classification model is obtained through an optimized variational autoencoder and an optimized incremental adaptive classifier.

[0036] Optionally, the dataset classification method based on unbalanced incremental learning, wherein obtaining the current dataset to be processed, inputting the current dataset to be processed into the dataset classification model, and outputting the dataset classification result specifically includes:

[0037] Obtain the current dataset to be processed, and input the current dataset to be processed into the dataset classification model;

[0038] The dataset classification model is used to perform academic warning analysis on the current dataset to be processed, and the dataset classification result is obtained.

[0039] Furthermore, to achieve the above objectives, the present invention also provides a dataset classification system based on imbalanced incremental learning, wherein the dataset classification system based on imbalanced incremental learning includes:

[0040] The preprocessing module is used to obtain the original dataset and preprocess the original dataset to obtain the training set and the test set;

[0041] The model training module is used to determine the unbalanced incremental learning model, and to train and optimize the unbalanced incremental learning model based on the training set and the test set to obtain a dataset classification model.

[0042] The model output module is used to obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

[0043] Furthermore, to achieve the above objectives, the present invention also provides a terminal, wherein the terminal includes: a memory, a processor, and a dataset classification program based on unbalanced incremental learning stored in the memory and executable on the processor, wherein when the dataset classification program based on unbalanced incremental learning is executed by the processor, it implements the steps of the dataset classification method based on unbalanced incremental learning as described above.

[0044] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a dataset classification program based on unbalanced incremental learning, and the dataset classification program based on unbalanced incremental learning, when executed by a processor, implements the steps of the dataset classification method based on unbalanced incremental learning as described above.

[0045] In this invention, an original dataset is acquired and preprocessed to obtain a training set and a test set. An imbalanced incremental learning model is determined, and the model is trained and optimized using the training and test sets to obtain a dataset classification model. The current dataset to be processed is acquired and input into the dataset classification model, outputting the dataset classification result. This invention obtains training and test sets by preprocessing the original dataset, and then trains and optimizes the imbalanced incremental learning model using these sets to obtain a dataset classification model. This dataset classification model improves the ability to identify minority samples and enables rapid and accurate dataset analysis, effectively improving the classification efficiency and accuracy of the dataset. Attached Figure Description

[0046] Figure 1 is a flowchart of a preferred embodiment of the dataset classification method based on unbalanced incremental learning of the present invention;

[0047] Figure 2 is a schematic diagram of the overall structure of a preferred embodiment of the dataset classification method based on unbalanced incremental learning of the present invention.

[0048] Figure 3 is a schematic diagram of the overall implementation process of a preferred embodiment of the dataset classification method based on unbalanced incremental learning of the present invention.

[0049] Figure 4 is a schematic diagram illustrating the student academic warning situation analysis of a preferred embodiment of the dataset classification method based on unbalanced incremental learning of the present invention.

[0050] Figure 5 is a structural diagram of a preferred embodiment of the dataset classification system based on unbalanced incremental learning according to the present invention;

[0051] Figure 6 is a structural diagram of a preferred embodiment of the terminal of the present invention. Detailed Implementation

[0052] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0053] With the development of the times, dynamic data environments are widely present in various application scenarios. Whether in medical, educational, or industrial process control environments, data not only flows in continuously, but is also often accompanied by the problem of category imbalance.

[0054] Traditional data processing methods struggle to effectively handle the dynamic changes in categories within data streams, especially in imbalanced data scenarios, often resulting in degraded model performance or weak ability to identify minority classes. Furthermore, information loss is a common challenge in dynamic data processing, particularly in constantly changing data streams. Maintaining a memory of old data while efficiently learning from new data has become a critical challenge that urgently needs to be addressed.

[0055] To address the aforementioned problems, this invention proposes a dataset classification method based on imbalanced incremental learning. This method utilizes a variational autoencoder for efficient data extraction, ensuring no information loss during the transformation process. Simultaneously, it combines the dynamic adaptability and adaptive weight adjustment mechanism of the incremental Adaboost classifier to enhance the recognition ability of minority class data when processing imbalanced datasets. This invention is a method combining variational autoencoders and incremental Adaboost classifiers, effectively improving the processing performance of imbalanced datasets.

[0056] The dataset classification method based on non-equilibrium incremental learning according to a preferred embodiment of the present invention, as shown in Figures 1 and 2, includes the following steps:

[0057] Step S10: Obtain the original dataset and preprocess the original dataset to obtain the training set and the test set.

[0058] Upon obtaining the original dataset, this invention performs data cleaning and preprocessing. This includes feature discrimination of the original dataset to determine whether it is a categorical feature, a numerical feature, or another type of feature, and then dividing it into a training dataset and a test set for the model.

[0059] Specifically, the original dataset is obtained and data cleaning is performed on the original dataset to obtain the first dataset; it is determined whether the first dataset is a classification feature. If so, one-hot encoding is used to perform numerical conversion on the first dataset to obtain the second dataset; the second dataset is divided into training set and test set.

[0060] Furthermore, if the first dataset is not a categorical feature, then it is determined whether the first dataset is a numerical feature; if so, a standardization tool is used to normalize the first dataset to obtain a third dataset; the third dataset is then divided into a training set and a test set.

[0061] As shown in Figure 3, after obtaining the original dataset, it is first necessary to clean and preprocess it. The preprocessing process is as follows: For unordered categorical features in the data (such as gender, name, etc.), one-hot encoding is used for numerical conversion; for ordered categorical features in the data, a standardization tool is used to normalize the features to ensure that each feature participates in training at the same scale. Then, the dataset is divided into a training set and a test set in an 8:2 ratio (i.e., the training set and test set in this invention).

[0062] Step S20: Determine the unbalanced incremental learning model, and train and optimize the unbalanced incremental learning model according to the training set and the test set to obtain a dataset classification model, wherein the unbalanced incremental learning model includes a variational autoencoder and an incremental adaptive adjustment classifier.

[0063] In dynamic data environments, information preservation and class balance are key bottlenecks in existing data processing methods. Due to the continuous addition of new data, traditional methods often face information loss during feature extraction and transformation. This loss not only affects the overall model performance but also weakens the memory of old data categories, making it impossible to balance the influence of new and old data. Furthermore, imbalanced datasets are particularly pronounced in real-time data streams, where minority class samples, due to their smaller number, are easily overlooked, causing the model to favor the majority class. Therefore, this invention proposes a solution that dynamically adjusts the learning strategy and weight mechanism by combining a variational autoencoder and an incremental Adaboost classifier. This ensures that information is maximized during feature generation and transformation, and improves sensitivity to minority class samples through real-time weight adjustment.

[0064] Understandably, class imbalance and information loss are the most critical problems in dynamic data environments, making model performance improvements more difficult compared to static data environments. Existing technologies suffer from inefficiency, insufficient adaptability, and inadequate minority class identification capabilities when facing dynamic changes in class evolution and feature distribution within data streams. Therefore, this invention proposes a data processing method to address class imbalance and information loss in dynamic data environments. By dynamically adjusting the learning strategy, it effectively improves the model's adaptability and stability in data streams while simultaneously ensuring effective identification of minority classes.

[0065] As shown in Figure 2, the present invention aims to solve the problems of class imbalance and information loss in dynamic data environments by combining variational autoencoder (VAE) with incremental Adaboost classifier (i.e., incremental adaptive adjustment classifier in the present invention), thereby achieving efficient adaptation to the dynamic evolution of classes in data streams.

[0066] This invention is divided into the following three parts:

[0067] 1. Feature Extraction Stage: The variational autoencoder is used to extract and optimize features from the data. The latent space of the variational autoencoder is used to reduce the dimensionality and compress the information of the data, ensuring that the key information in the original data is preserved to the maximum extent during the feature generation process. At the same time, the self-attention mechanism is used to enhance the global correlation representation ability between features.

[0068] 2. Dynamic Classification Stage: An incremental Adaboost (adaptive boosting algorithm) classifier is introduced to classify the extracted features. A dynamic sample weighting mechanism is adopted to progressively update the model parameters in response to changes in the class in the data stream, enhancing the model's adaptability to new classes. Simultaneously, by calculating sample weights, the distribution of samples from different classes in the dataset is balanced, optimizing the model's classification performance on imbalanced data.

[0069] 3. Real-time Adjustment Phase: During the classification process, by dynamically analyzing the changes in the class distribution in the current data stream and combining it with an adaptive weight adjustment mechanism, the weight ratio of different classes in the loss function is adjusted to avoid the model being overly biased towards the majority class. Based on this, perturbation analysis is used to test the model's robustness, effectively verifying its stability in the face of input data fluctuations or noise.

[0070] Furthermore, this invention combines residual analysis, learning curve plotting, and feature importance assessment to comprehensively analyze and optimize the model's performance. By combining a dynamic incremental learning strategy and an unbalanced optimization mechanism, this invention significantly improves the model's adaptability and generalization ability in dynamic data flow environments, while also improving the recognition performance for minority classes.

[0071] Specifically, the training set is input into the encoder of the variational autoencoder, which performs data mapping and gradient descent optimization on the training set to obtain latent variables. The latent variables are then input into the decoder, which performs remapping on them to obtain reconstructed data. The reconstruction error and relative entropy between the training set and the reconstructed data are calculated, and the reconstruction error and relative entropy are weighted to obtain an overall loss function. The variational autoencoder is optimized based on the overall loss function to obtain an optimized variational autoencoder, and the mean and variance of the latent variables are output.

[0072] Feature Extraction Principle: In dynamic data environments, feature distributions often change over time, making it difficult for traditional methods to achieve feature dimensionality reduction and optimization while maintaining information integrity. This invention uses a variational autoencoder to perform preliminary feature extraction on the input data. The encoder generates the mean and variance of the data and maps it to a latent space. Features in the latent space are compact and information-complete, helping to reduce redundant information in the input data. Furthermore, a self-attention mechanism is embedded in the encoding process, enabling the model to capture global dependencies between input features, further optimizing feature representation.

[0073] As shown in Figures 2 and 3, in the feature extraction stage, this invention utilizes a variational autoencoder (VAE) to reduce the dimensionality of the input data. The encoder module of the VAE generates a latent space representation of the data, thereby ensuring efficient compression and preservation of feature information. Simultaneously, a self-attention mechanism is embedded to model the global dependencies between features, further optimizing the feature representation.

[0074] In the specific implementation process, the input data is first received. Preferably, a dataset is used (although other datasets can also be used, depending on the actual needs). The dataset is then fed into the encoder of the variational autoencoder. The encoder is a neural network that maps the input data to parameters in the latent space. Assuming the input data is a vector x, the encoder's task is to learn the distribution of the latent space. Typically, this latent space is assumed to follow a normal distribution. The encoder's output is the mean μ and variance σ of the latent space; these parameters describe the probability distribution of the latent variables. Therefore, the encoder's output is: and .

[0075] To effectively perform gradient descent optimization, VAE employs a reparameterization technique, which allows sampling from a random variable following a latent distribution, rather than directly from a complex distribution. This enables backpropagation and gradient computation throughout the process. Assume the latent variable z follows a normal distribution. The reparameterization technique expresses the latent variable z as follows: in, It is a standard normally distributed random noise, representing a random variable sampled from a standard normally distributed system with a mean of 0 and a variance of 1. Wherein, As a random variable, it is often used as noise and is introduced in reparameterization tricks to sample from a standard normal distribution to aid in gradient computation. It means "to obey", that is The value is sampled from a specified probability distribution. Given a multivariate normal distribution, 0: represents a vector with a mean of zero (the mean of each dimension is 0). Let z be a unit covariance matrix, where different dimensions of the distribution are independent of each other, and the variance of each dimension is 1. In this way, z becomes differentiable, and can be optimized using gradient descent.

[0076] Furthermore, the latent variable z is passed to the decoder, which is also a neural network. The decoder maps the variable z from the latent space back to the data space, generating new data with a similar structure to the original input data. Specifically, the decoder generates the reconstructed data through the following process: : ,in, It is the neural network function of the decoder, which maps the latent variable z to a space with the same dimension as the input data.

[0077] The training objective of VAE is to minimize the difference between the input and reconstructed data while ensuring that the distribution of the latent space approximates a standard normal distribution. The training process of VAE consists of two main parts: reconstruction error and KL divergence. Reconstruction error measures the difference between the input and reconstructed data, typically using mean squared error (MSE) or cross-entropy. The goal of reconstruction error is to make the generated samples as close as possible to the original input samples. KL divergence measures the distribution of latent variables. and prior distribution The difference between them is usually addressed using KL divergence to constrain the distribution of the latent space to approximate a standard normal distribution. This avoids overly complex latent spatial distributions, and the purpose of this term is to enable the model to generate samples with better generalization ability. The expression for KL divergence is: ,in, This represents the value of the KL divergence, used as part of the loss function. The sign for divergence is denoted by , which measures the difference between two distributions. The overall loss function of VAE is a weighted sum of reconstruction error and KL divergence: , Let VAE be the overall loss function. To reconstruct errors, This is the KL divergence. During training, the VAE can update the model's parameters by optimizing this overall loss function.

[0078] After training, VAEs can be used to generate new data. For example, in image generation tasks, VAEs can sample a standard normal distribution from the latent space. Then, the sample z is input into the decoder to generate a new image. The generated new image is based on the latent variables sampled from the latent space and has a distribution similar to that of the training data.

[0079] Embedding self-attention mechanisms in VAEs typically involves adding self-attention layers to the encoder or decoder. After the input features pass through several convolutional layers, a self-attention module is added. This module is used to compute the relationships between each feature and to perform weighted combinations of the input features based on these relationships, thereby generating an enhanced feature representation.

[0080] The self-attention process in a VAE encoder is as follows: 1. Input data passes through convolutional layers in the encoder. 2. Through the self-attention layer, the global dependencies of the input features are calculated to obtain a weighted feature representation. 3. Based on the weighted feature representation, the mean μ and variance σ of the latent variables are generated. 4. The latent variable z is sampled from the latent space using a reparameterization technique.

[0081] The self-attention process in the VAE decoder is as follows: 1. The decoder starts with the latent variable z and recovers the data through a series of deconvolutions. 2. A self-attention layer is added to model the global dependencies between features during the recovery process. 3. By optimizing the reconstruction error and KL divergence, reconstructed data that is as similar as possible to the input data is generated.

[0082] As shown in Figures 2 and 3, after the feature extraction stage is completed, the dynamic classification stage begins. This invention introduces an incremental Adaboost (adaptive boosting algorithm) classifier to classify the extracted features. A dynamic sample weighting mechanism is used to progressively update the model (i.e., the incremental Adaboost classifier model) parameters in response to changes in the class distribution in the data stream, enhancing the model's adaptability to new class data. Simultaneously, by calculating sample weights, the distribution of samples from different classes in the dataset is balanced, optimizing the model's classification performance on imbalanced data.

[0083] After the feature extraction stage is completed, the real-time adjustment stage begins: During classification, by dynamically analyzing the changes in the class distribution in the current data stream and combining it with an adaptive weight adjustment mechanism, the weight ratio of different classes in the loss function is adjusted to avoid the model being overly biased towards the majority class. Based on this, perturbation analysis is used to test the robustness of the model, verifying its stability in the face of input data fluctuations or noise.

[0084] Specifically, the mean and variance of the latent variables are initialized with sample weights to obtain initialized sample weights; multiple preset first classifiers are determined, and the multiple preset first classifiers are trained according to the initialized sample weights to obtain multiple second classifiers; the weighted error rate of each second classifier on the training set is calculated, and the classifier weight of each second classifier is calculated according to the weighted error rate; the classification results of the multiple second classifiers are obtained, and the classifier weights are updated according to the classification results and the weighted error rate to obtain an optimized incremental adaptive classifier; an initial dataset classification model is obtained through the optimized variational autoencoder and the optimized incremental adaptive classifier.

[0085] The dynamic classification principle set in this invention is as follows: The incremental Adaboost classifier is used to process the features extracted by VAE in this invention. Adaboost constructs a series of weak classifiers. Here, this invention selects the linear regression model, decision tree model, SVM model, XGBoost model and LightGBM model as weak classifiers (i.e., multiple preset first classifiers in this invention) to make full use of the advantages of each model. The advantages and disadvantages of the above models are compared and analyzed in Table 1.

[0086] Table 1: Comparison and Analysis of the Advantages and Disadvantages of Each Model

[0087] ;

[0088] Furthermore, this invention dynamically adjusts the weights of each classifier based on the classification error rate to form a strong classifier (selecting linear regression, decision tree, SVM, XGBoost, and LightGBM models as weak classifiers, and constructing the best-performing strong classifier by adjusting different parameters of each model, i.e., the optimized incremental adaptive classifier in this invention). In an environment where data flow changes dynamically, this invention combines an incremental learning method to gradually update the classifier model to adapt to newly added data categories, while retaining the memory of old data categories. This dynamic adjustment mechanism ensures that both the majority and minority classes in the data can be effectively identified through sample weighting.

[0089] Understandably, during the classification phase, an incremental Adaboost classifier is used to classify the latent features extracted by the VAE. The classifier incorporates a dynamic sample weighting mechanism to address class imbalance, adaptively adjusting the contributions of minority and majority class samples during model training by calculating sample weights. Furthermore, the classifier model is progressively updated using an incremental learning method to adapt to new class distributions in a dynamic data flow environment.

[0090] The specific implementation process of classification is as follows: 1. Initialize sample weights: At the beginning of each iteration of Adaboost, the weights of all training samples are initialized to equal values. The expression for initializing the training sample weights is: , Initialize weights for training samples, for all samples , where N is the total number of training samples, to ensure that each sample contributes equally to the training of the model in the initial stage.

[0091] The specific process of training a weak classifier is as follows: 1. Selecting a weak classifier: In each round, select a weak classifier (linear regression model, decision tree, SVM, XGBoost, or LightGBM) and train it using the current sample weights. Each weak classifier is a simple model and usually performs poorly, but by ensembling multiple weak classifiers, a powerful classifier can be obtained. 2. Training process: Each weak classifier is trained based on the current sample weights, using a weighted dataset. The model will pay more attention to samples that are misclassified during training. 3. Calculating the weighted error rate of the weak classifier: After training the weak classifier, calculate its weighted error rate on the current sample set. The weighted error rate measures the classifier's performance given the sample weights. The specific calculation method is as follows: ,in, It is the first The weighted error rate of the round weak classifier It is the first Weak classifiers for samples The prediction It is a sample The true label, It is a sample 4. Calculate the weights of the weak classifiers: the weights of each weak classifier. Based on its weighted error rate The weights of weak classifiers are calculated based on their performance in the current iteration; the lower the error rate, the greater the weight of the classifier. The formula for calculating the weights of weak classifiers is: According to this formula, if the classifier's error rate is less than 0.5, A positive value indicates that the classifier makes a significant contribution to the weighted combination. If the error rate is greater than or equal to 0.5, it indicates that the classifier is performing poorly, and the weights will be negative. In this case, its performance can be optimized through further training or adjustments. 5. Update Sample Weights: Based on the current classification results and error rate of the weak classifier, update the weights of each training sample. The weights of samples will increase when they are misclassified and decrease when they are correctly classified. The specific update formula is as follows: ,in, It is a sample The weight in round t, It is the weight of the weak classifier in round t. It is a sample The true label, It is the t-th round weak classifier's performance on the samples The prediction, updated sample weights This indicates the importance of a sample in the next round of training. If a sample is misclassified, its weight increases, meaning that the sample will be given higher priority in the next classifier training. Conversely, the weight of correctly classified samples decreases. 6. The steps of incremental learning are as follows: 1. Dynamically update sample weights: Each time data flows in, the sample weights are updated according to the misclassification of the new data; 2. Update weak classifiers: Incremental Adaboost trains new weak classifiers based on the new data and the updated sample weights. The weights of the weak classifiers are gradually adjusted to adapt to the new data distribution; 3. Retain memory of old data: Even when new data arrives, incremental Adaboost retains the memory of the old data, ensuring a balance between the old and new data through a sample weighting mechanism; 7. Weighted combination of weak classifiers: After each iteration, Adaboost weights and combines multiple weak classifiers to form the final strong classifier. The final classification result is determined by the weighted prediction results of all weak classifiers, and the formula for the final classification result is as follows: ,in, This is the final classification result, where T is the total number of weak classifiers (number of iterations). It is the t-th round weak classifier's performance on the samples The prediction is the weight of the weak classifier in round t. The final classification decision is determined by the weighted result of all weak classifiers. The ensemble model makes a decision based on the prediction results of multiple weak classifiers, thereby improving classification performance.

[0092] In summary, the incremental AdaBoost classifier's feature classification process can be summarized in the following steps: 1. Initialize sample weights: All sample weights are initialized to be equal. 2. Train weak classifiers: In each iteration, a weak classifier (linear regression, decision tree, SVM, XGBoost, or LightGBM) is selected and trained. 3. Calculate error rate and weights: The weights of the classifiers are calculated based on their error rates. 4. Update sample weights: The sample weights are updated based on the classifier's performance, increasing the weights of misclassified samples. 5. Incremental learning: When new data arrives, the classifier is updated based on the existing model, rather than retraining the model. 6. Weighted combination of weak classifiers: The outputs of all weak classifiers are weighted and combined to form the final strong classifier.

[0093] The initial dataset classification model is evaluated using the test set to obtain model performance evaluation results. The model performance evaluation process includes classification performance evaluation, residual analysis, robustness testing, and feature importance evaluation. Based on the model performance evaluation results, the initial dataset classification model is optimized to obtain a dataset classification model.

[0094] As shown in Figure 3, after model training is complete, the model's performance needs to be evaluated (on the test set). This mainly includes the following aspects: 1. Classification performance: Evaluate the model's classification effect on the test set using metrics such as accuracy, precision, recall, and F1-score. 2. Residual analysis: Visualize the deviation between the predicted results and the true values ​​to evaluate the model's prediction accuracy. 3. Robustness testing: Add perturbations of different intensities to the test set to verify the model's adaptability to data fluctuations and noise. 4. Feature importance assessment: Explore which features contribute most to the classification task using feature importance analysis provided by Adaboost.

[0095] This invention comprehensively analyzes and optimizes the model's performance by combining residual analysis, learning curve plotting, and feature importance assessment. Through the combination of dynamic incremental learning strategy and unbalanced optimization mechanism, it significantly improves the model's adaptability and generalization ability in dynamic data flow environments, while also improving the recognition effect of minority classes.

[0096] Step S30: Obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

[0097] Specifically, the current dataset to be processed is obtained and input into the dataset classification model; the dataset classification model is used to perform academic warning analysis on the current dataset to be processed to obtain the dataset classification result.

[0098] For example, in the process of early warning analysis of student academic data, the early warning results include pass or fail. However, student academic data is fluid (constantly updated) in the overall dataset, and the academic data with the early warning result of fail is a minority sample. Traditional data processing models are difficult to effectively deal with the problem of dynamic changes in categories in student academic data. Especially in imbalanced data scenarios, there is often a phenomenon of decreased model performance and weak ability to identify minority data, resulting in low efficiency and inaccurate analysis results for early warning analysis of student academic data.

[0099] As shown in Figure 4, once the dataset classification model is built, the current dataset to be processed (preferably student academic data in this invention) that needs to be analyzed for early warning can be input into the built model to analyze the student academic warning situation (the warning result is divided into pass and fail, where the student academic data corresponding to the fail result is a minority class sample. Therefore, the dataset classification result built in this invention can accurately and quickly complete this warning analysis). The data obtained is as follows: there are 104 samples in category 0 (i.e., majority class samples) and only 20 samples in category 1 (i.e., minority class samples). The overall accuracy is 0.95, the macro average F1 value is 0.91, and the weighted average F1 value is 0.95, indicating that the overall performance of the model is good. The identification ability of the minority class can be further optimized by data augmentation or adjusting the classification threshold.

[0100] This invention proposes a dataset classification method based on imbalanced incremental learning. This method can dynamically adapt to datasets from different domains (without limiting the scope of the dataset, thus possessing universal applicability), solving the problems of imbalanced class distribution and severe forgetting during incremental learning. The method consists of three parts: data processing, model optimization, and evaluation, and each part supports user-defined extensions. Experimental results validate that this method performs excellently on multiple datasets, demonstrating its universality and scalability.

[0101] This invention uses a variational autoencoder to perform preliminary feature extraction on the data, ensuring maximum information retention during feature generation and representation transformation. Secondly, it utilizes an incremental Adaboost classifier to classify the extracted features, dynamically adapting to newly added data categories and gradually updating the model to cope with constantly changing feature distributions and data flows. Finally, by combining an adaptive weight adjustment mechanism, it dynamically adjusts the weights of different categories in the loss function through real-time analysis of category changes in the current data flow, ensuring that the model does not excessively favor the minority class when dealing with imbalanced datasets and retains good minority class recognition capabilities.

[0102] In summary, to address the class imbalance and information loss problems faced by existing data processing methods in dynamic data environments, this invention proposes a novel data processing method based on imbalanced incremental learning. This method effectively copes with class evolution in data streams by dynamically adjusting the learning strategy, overcoming the inefficiency of traditional data processing methods when dealing with constantly changing data streams. This method combines variational autoencoders and incremental Adaboost classifiers, which not only improves the model's adaptability to new data but also optimizes the retention of old data class information.

[0103] Overall, this invention, by combining a variety of advanced machine learning techniques, not only improves the efficiency of processing imbalanced datasets, but also enhances the robustness of the model in the face of dynamic changes in data flow, exhibiting efficient learning capabilities and excellent class balancing performance.

[0104] This invention has a wide range of market applications, especially in real-time data analysis scenarios. Besides student academic performance monitoring, it can also be applied to real-time medical monitoring and abnormal behavior monitoring. Its significance lies in solving the problems of inefficiency and information loss in traditional methods under dynamic data environments, providing a more efficient and reliable solution for data processing and decision-making in real-time systems. By enhancing the ability to identify minority classes, this invention has profound market value and application prospects in improving the accuracy and stability of models.

[0105] Furthermore, as shown in Figure 5, based on the above-mentioned dataset classification method based on imbalanced incremental learning, the present invention also provides a dataset classification system based on imbalanced incremental learning, wherein the dataset classification system based on imbalanced incremental learning includes:

[0106] The preprocessing module 51 is used to obtain the original dataset and preprocess the original dataset to obtain the training set and the test set;

[0107] The model training module 52 is used to determine the unbalanced incremental learning model, and to train and optimize the unbalanced incremental learning model according to the training set and the test set to obtain a dataset classification model.

[0108] The model output module 53 is used to obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

[0109] Furthermore, as shown in Figure 6, based on the aforementioned dataset classification method and system based on unbalanced incremental learning, the present invention also provides a terminal, which includes a processor 10, a memory 20, and a display 30. Figure 6 only shows some components of the terminal; however, it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

[0110] In some embodiments, the memory 20 may be an internal storage unit of the terminal, such as a hard disk or memory. In other embodiments, the memory 20 may be an external storage device of the terminal, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc. Further, the memory 20 may include both internal and external storage devices. The memory 20 is used to store application software and various types of data installed on the terminal, such as program code installed on the terminal. The memory 20 can also be used to temporarily store data that has been output or will be output. In one embodiment, the memory 20 stores a dataset classification program 40 based on unbalanced incremental learning, which can be executed by the processor 10 to implement the dataset classification method based on unbalanced incremental learning in this application.

[0111] In some embodiments, the processor 10 may be a central processing unit (CPU), a microprocessor, or other data processing chip, used to run program code stored in the memory 20 or process data, such as executing the dataset classification method based on unbalanced incremental learning.

[0112] In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. The display 30 is used to display information on the terminal and to display a visual user interface.

[0113] In one embodiment, when the processor 10 executes the dataset classification program 40 based on unbalanced incremental learning in the memory 20, it implements the steps of the dataset classification method based on unbalanced incremental learning as described above.

[0114] The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a dataset classification program based on unbalanced incremental learning, and the dataset classification program based on unbalanced incremental learning implements the steps of the dataset classification method based on unbalanced incremental learning as described above when executed by a processor.

[0115] In summary, this invention provides a dataset classification method, system, terminal, and computer-readable storage medium based on imbalanced incremental learning. The method includes: acquiring an original dataset and preprocessing the original dataset to obtain a training set and a test set; determining an imbalanced incremental learning model and training and optimizing the imbalanced incremental learning model based on the training set and the test set to obtain a dataset classification model; acquiring the current dataset to be processed and inputting the current dataset to be processed into the dataset classification model to output the dataset classification result. This invention obtains a training set and a test set by preprocessing the original dataset, and then trains and optimizes the imbalanced incremental learning model using the training set and the test set to obtain a dataset classification model. The dataset classification model can improve the ability to identify a minority of samples, and at the same time, it can quickly and accurately analyze the dataset, effectively improving the classification efficiency and accuracy of the dataset.

[0116] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal that includes that element.

[0117] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The computer-readable storage medium can be a memory, magnetic disk, optical disk, etc.

[0118] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.

Claims

1. A dataset classification method based on unbalanced incremental learning, characterized in that, The dataset classification method based on non-equilibrium incremental learning includes: Obtain the original dataset and preprocess it to obtain the training set and the test set; A non-equilibrium incremental learning model is determined, and the non-equilibrium incremental learning model is trained and optimized based on the training set and the test set to obtain a dataset classification model. Obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

2. The dataset classification method based on unbalanced incremental learning according to claim 1, characterized in that, The process of obtaining the original dataset and preprocessing it to obtain the training and test sets specifically includes: Obtain the original dataset and perform data cleaning on the original dataset to obtain the first dataset; Determine whether the first dataset is a categorical feature. If so, use one-hot encoding to perform numerical conversion on the first dataset to obtain the second dataset. The second dataset is then divided into a training set and a test set.

3. The dataset classification method based on non-uniform incremental learning according to claim 2, characterized in that, The step of determining whether the first dataset is a categorical feature further includes: If the first dataset is not a categorical feature, then determine whether the first dataset is a numerical feature; If so, the first dataset is normalized using a standardization tool to obtain the third dataset; The third dataset is then divided into training and test sets.

4. The dataset classification method based on unbalanced incremental learning according to claim 1, characterized in that, The non-equilibrium incremental learning model includes a variational autoencoder and an incremental adaptive adjustment classifier; The process of determining an imbalanced incremental learning model and training and optimizing the model using the training set and the test set to obtain a dataset classification model specifically includes: The training set is input into the variational autoencoder in the unbalanced incremental learning model, and the variational autoencoder performs feature extraction processing on the training set to obtain the mean and variance of the latent variables. The mean and variance of the latent variables are input into the incremental adaptive adjustment classifier. The incremental adaptive adjustment classifier is used to perform classification training on the mean and variance of the latent variables to obtain the initial dataset classification model. The initial dataset classification model is evaluated using the test set to obtain model performance evaluation results. The model performance evaluation process includes classification performance evaluation, residual analysis, robustness testing, and feature importance evaluation. Based on the model performance evaluation results, the initial dataset classification model is optimized to obtain a dataset classification model.

5. The dataset classification method based on non-uniform incremental learning according to claim 4, characterized in that, The variational autoencoder includes an encoder and a decoder; The step of inputting the training set into the variational autoencoder in the non-equilibrium incremental learning model, and performing feature extraction processing on the training set through the variational autoencoder to obtain the mean and variance of the latent variables, specifically includes: The training set is input into the encoder in the variational autoencoder, and the encoder performs data mapping and gradient descent optimization on the training set to obtain latent variables. The latent variables are input into the decoder, and the decoder performs remapping processing on the latent variables to obtain reconstructed data; Calculate the reconstruction error and relative entropy between the training set and the reconstructed data, and weight the reconstruction error and the relative entropy to obtain the overall loss function; The variational autoencoder is optimized based on the overall loss function to obtain the optimized variational autoencoder, and the mean and variance of the latent variables are output.

6. The dataset classification method based on non-uniform incremental learning according to claim 5, characterized in that, The step of inputting the mean and variance of the latent variables into the incremental adaptive classifier, and using the incremental adaptive classifier to perform classification training on the mean and variance of the latent variables to obtain an initial dataset classification model, specifically includes: The mean and variance of the latent variables are initialized with sample weights to obtain the initialized sample weights; Multiple preset first classifiers are determined, and multiple preset first classifiers are trained according to the initial sample weights to obtain multiple second classifiers; Calculate the weighted error rate of each second classifier on the training set, and calculate the classifier weight of each second classifier based on the weighted error rate; The classification results of multiple second classifiers are obtained, and the classifier weights are updated according to the classification results and the weighted error rate to obtain an optimized incremental adaptive adjustment classifier. The initial dataset classification model is obtained through an optimized variational autoencoder and an optimized incremental adaptive classifier.

7. The dataset classification method based on non-uniform incremental learning according to claim 1, characterized in that, The process of obtaining the current dataset to be processed, inputting the current dataset to be processed into the dataset classification model, and outputting the dataset classification result specifically includes: Obtain the current dataset to be processed, and input the current dataset to be processed into the dataset classification model; The dataset classification model is used to perform academic warning analysis on the current dataset to be processed, and the dataset classification result is obtained.

8. A dataset classification system based on unbalanced incremental learning, characterized in that, The dataset classification system based on non-equilibrium incremental learning includes: The preprocessing module is used to obtain the original dataset and preprocess the original dataset to obtain the training set and the test set; The model training module is used to determine the unbalanced incremental learning model, and to train and optimize the unbalanced incremental learning model based on the training set and the test set to obtain a dataset classification model. The model output module is used to obtain the current dataset to be processed, input the current dataset to be processed into the dataset classification model, and output the dataset classification result.

9. A terminal, characterized in that, The terminal includes: a memory, a processor, and a dataset classification program based on unbalanced incremental learning stored in the memory and executable on the processor. When the dataset classification program based on unbalanced incremental learning is executed by the processor, it implements the steps of the dataset classification method based on unbalanced incremental learning as described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a dataset classification program based on unbalanced incremental learning, which, when executed by a processor, implements the steps of the dataset classification method based on unbalanced incremental learning as described in any one of claims 1-7.