Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

169 results about "Class imbalance" patented technology

Class imbalance is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families (e.g. Kinase, GPCR). Failure to account for the class imbalance often causes...

Unbalanced text classification method and system combining SVM and semi-supervised clustering

The invention discloses an unbalanced text classification method and system combining SVM and semi-supervised clustering. The unbalanced text classification method comprises the steps: carrying out preprocessing on a to-be-processed text, and obtaining text data in a vector format, and enabling the text data to serve as a data set; using the training set to train the SVM classifier to obtain a classification model, and using the classification model to predict the test set to obtain the category and confidence of the test set; clustering the data set by using a semi-supervised clustering algorithm to obtain the category to which the test set belongs and the confidence coefficient of the test set; and fusing the category to which the test set obtained by the SVM classifier and the semi-supervised clustering algorithm belongs and the confidence coefficient of the test set to obtain final output. According to the unbalanced text classification method, different types of methods in the technical field of unbalanced text classification are combined; advantage complementation of the different methods is achieved; vectorization and normalization methods are used; and the defect that whenhigh-dimensional sparse text data are processed, a text classification result is inaccurate due to the fact that labeled texts are too few is overcome. The unbalanced text classification method effectively solves the problem of text class imbalance.
Owner:JIANGSU UNIV

Software defect prediction method based on two-stage wrapping-type feature selection

The invention discloses a software defect prediction method based on two-stage wrapping-type feature selection, and belongs to the field of software quality assurance. The software defect prediction method comprises the following steps: (1) mining the version control system and the defect tracking system of a software project, extracting a program module from the version control system and the defect tracking system, and carrying out type marking and software measurement on the program module to generate a defect prediction data set D; (2) carrying out two-stage wrapping-type feature selection on the defect prediction data set so as to remove redundant features and irrelevant features in the data set D as many as possible, and finally, selecting an optimal feature subset FS' from an original feature set FS; and (3) on the basis of the optimal feature subset FS', preprocessing the data set D, forming a preprocessed data set D', and finally, constructing a defect prediction model in virtue of a decision tree which is a classification method. By use of the software defect prediction method, on one hand, the redundant features and the irrelevant features in the defect prediction data set can be effectively identified and removed, on the other hand, a class imbalance problem in the defect prediction data set can be effectively alleviated, and finally, the performance of the defect prediction model can be effectively improved.
Owner:南京瑞沃软件有限公司

Multi-feature software defect comprehensive prediction method based on unbalanced noise set

ActiveCN111782512ASolve the problem that the measurement is not comprehensive enoughReduce complexityCharacter and pattern recognitionSoftware testing/debuggingData setNetwork structure
The invention discloses a multi-feature software defect comprehensive prediction method based on an unbalanced noise set. The multi-feature software defect comprehensive prediction method comprises the following steps: constructing an initial data set containing code features, development process features and network structure features; performing preliminary undersampling processing on the data set, and reducing repeated data in most classes; searching a k nearest neighbor sample set for metric elements in the data set through a tendency score matching method; realizing noise reduction processing of the data set through k nearest neighbor sample threshold judgment; performing sample synthesis on the minority class in the data set and the minority class in the k nearest neighbor sample setto eliminate the class imbalance problem of the data set; and adaptively constructing a plurality of machine learning models and selecting the most suitable machine learning model to perform defect prediction on the new version of software. According to the method, the problem of class imbalance generally existing in software defect prediction is solved. And noise samples are removed based on noise discrimination processing of tendency score matching.
Owner:北京高质系统科技有限公司

Data equalization method based on deep learning multi-weight loss function

The invention relates to a data equalization method based on a deep learning multi-weight loss function, and the method comprises the steps: firstly obtaining a target image data set in a training process employing a deep learning model, determining the class number C of data samples and the size Ni of each class of samples according to the target data set, determining hyper-parameters [alpha] and [gamma] and a weighting coefficient Ci of the importance of each class of samples, and determining a multi-weight loss function MWLfocal (z, y), carrying out continuous iterative training by using the neural network model, carrying out error calculation by using the multi-weight loss function in the training process, and continuously updating weight parameters of the model by using a back propagation algorithm until network convergence reaches an expected target, thereby finally completing training. By means of the loss function, the problems of sample number imbalance and classification difficulty imbalance of different data classes can be solved at the same time, the detection accuracy of key classes can be further improved, the method can be applied to a data set with the data imbalance problem, and therefore the influence of the class imbalance problem is effectively relieved.
Owner:UNIV OF SCI & TECH OF CHINA

Unbalanced-like network traffic classification method and device and computer equipment

The invention relates to the technical field of network traffic classification, and relates to an unbalanced-like network traffic classification method and device and computer equipment. The method comprises the steps of obtaining to-be-classified network traffic data, and extracting features of network traffic; deleting irrelevant features and redundant features by adopting a feature selection algorithm, and performing dimension reduction on the remaining features so as to select an optimal feature subset; and inputting the optimal feature subset into a weight-based multi-classifier, performing network traffic classification training in an incremental learning mode, optimizing classifier performance, and classifying the network traffic. Aiming at the problem of unbalanced distribution ofnetwork traffic samples, irrelevant features and redundant features are deleted, and the recognition rate of small categories is effectively improved on the premise of ensuring the overall classification accuracy; an incremental learning thought is introduced, the flexibility of model updating training is improved, and the model updating period is shortened; and by utilizing multiple classifiers based on weight, the influence caused by concept drift is reduced.
Owner:CHONGQING UNIV OF POSTS & TELECOMM

Evaluation method for performance influence degree of classification models by class imbalance

The invention relates to an evaluation method for performance influence degree of classification models by class imbalance. The evaluation method comprises the following steps of (1) building a classification model base; (2) constructing a new data set; (3) forecasting the new data set by the classification models; (4) evaluating the performance of the classification models; and (5) evaluating an influence degree level. According to the evaluation method, firstly, a typical classification algorithm in machine learning is adopted to build the classification model base; secondly, a class imbalance data set is selected as a reference data set, a group of new data sets with imbalance ratio gradually increased is built on the basis, different classification models are selected to respectively classify and forecast the group of new data sets; and finally, a variable coefficient is adopted to evaluate the performance variation degree of the classification models and also carry out level division, thus, the influence degree of the class imbalance on the performance of different classification models is evaluated, and a guidance significance is played in research on the class imbalance process. With regards to different classification models, the evaluation method for performance influence degree of the classification models by class imbalance, provided by the invention, has high universality.
Owner:CHINA UNIV OF MINING & TECH

Network intrusion detection model SGM-CNN based on class imbalance processing

For the data class imbalance problem, the present invention provides an effective network intrusion detection model SGM-CNN based on a Synthetic Minority Over-Sampling Technique (SMOTE) and a GaussianMixture Model (GMM) based on a data flow. According to the technical scheme, the method comprises the steps of firstly obtaining a to-be-identified network data flow; and preprocessing the data stream, inputting the preprocessed data stream into a pre-established network intrusion detection model based on a one-dimensional convolutional neural network (1D CNN), and outputting a detection result of the network data stream. The invention provides a class imbalance processing technology, namely an SGM, for large-scale data. The SGM firstly uses SMOTE to perform oversampling on minority class samples, then uses GMM to perform clustering-based downsampling on majority class samples, and finally balances data of each class. According to the SGM method, expensive time and space cost caused by oversampling is avoided, the situation that important samples are lost due to random downsampling is avoided, and the detection rate of minority classes is remarkably increased.
Owner:ZHENGZHOU UNIV

Advertisement click rate prediction framework and algorithm based on user behaviours

InactiveCN108830416APredict interest drift in real timePredicting the probability of interest drift in real timeForecastingMarketingFeature extractionHigh dimensional
The invention discloses an advertisement click rate prediction framework and algorithm based on user behaviours. ID characteristics and other characteristics are co-converted in different levels intomeaningful numerical characteristics; due to the characteristics, the characteristic sparsity and redundancy can be reduced; the characteristic expressiveness can be improved; simultaneously, to further improve the characteristic expressiveness, characteristic selection and characteristic combination are carried out by utilization of a GBDT model in the invention; high-dimensional characteristicsare processed by utilization of an LR model; finally, to solve a class imbalance problem, a down-sampling algorithm based on a K_Means model is provided in the invention; in an experimental process, characteristic extraction on original characteristics is carried out at first; then, characteristic classification is carried out by adoption of heuristic thinking; characteristic combination is carried out by inputting perceptual characteristics into the GBDT model; finally, rational characteristics and combination characteristics are input into the LR model with a certain weight, so that advertisement click rate prediction is carried out; and an experimental result shows that the algorithm in the invention is improved both on RMSE and R2 indexes.
Owner:SICHUAN UNIV

Data resampling method based on repeated editing nearest neighbor and clustering oversampling

InactiveCN110942153ASolve the class imbalance problemImprove classification effectMachine learningDistance matrixAlgorithm
The invention relates to a data resampling method based on repeated editing nearest neighbor and clustering oversampling. The method comprises the steps: calculating the Euclidean distance between each to-be-sampled book and a nearby sample, selecting the sample with the smallest distance as the nearby sample of the to-be-sampled book, comparing whether the labels of the sample and the nearby sample are the same or not, and deleting the sample if the labels of the sample and the nearby sample are different; dividing the remaining samples into k clusters by using K-means, and filtering out theclusters of which the ratio of the number of majority class samples to the number of minority class samples is less than an imbalance rate threshold c; calculating an Euclidean distance between minority class samples in each cluster, constructing a distance matrix of the cluster, summing all off-diagonal elements in the matrix, and dividing the sum by the number of the off-diagonal elements to obtain an average distance of the cluster; calculating a sparse factor of each cluster; and calculating a resampling weight value of each cluster, and determining the number of generated new samples according to the weight values by using an SMOTE method. According to the method, the problem of class imbalance in the data is solved, so that the classifier can obtain a better classification effect.
Owner:NORTHWESTERN POLYTECHNICAL UNIV

Flight delay early warning method based on evolutionary sub-sampling integrated learning

The invention discloses a flight delay early warning method based on evolutionary sub-sampling integrated learning and belongs to the technical field of airport flight delay early warning. The method specifically comprises the following steps of: firstly, carrying out discretization processing on target attributes of flight delay measured data sets, removing noise points, and obtaining standardized data sets; then, using an evolutionary sub-sampling method to carry out T times of sub-sampling on most classes of the data sets of class imbalance, and constructing T balanced training sets; using a grid searching technology to carry out parameter optimization of a classification regression decision tree classifier on each balanced training set, and generating classifiers; and finally, determining an optimal integration mode to form an integrated system EUS-Bag by the classifiers, which is namely a flight delay early warning model. The early warning model is capable of providing a decision making basis for reasonable air traffic scheduling to an air management department. The method is high in intelligent degree, and the accuracy and reliability of flight delay early warning are effectively improved.
Owner:NANJING UNIV OF AERONAUTICS & ASTRONAUTICS

Forest fire danger grade determination method and system based on one-class SVM

The invention provides a forest fire danger grade determination method and a forest fire danger grade determination system based on a one-class SVM. The forest fire danger grade determination method comprises the steps of: regarding a day as a sample unit, and selecting samples suffering fire disasters as modeling samples according to fire data; acquiring meteorological factors corresponding to the modeling samples; constructing a one-class SVM model based on the meteorological factors corresponding to the modeling samples; constructing a forest fire danger occurrence probability model, namely, mapping a value-taking interval of distances from the samples output by the one-class SVM model intermediately to a sphere center of a hypersphere in the one-class SVM model to [0, 1], and taking a mapping result as a forest fire danger occurrence probability; and calculating a forest fire danger occurrence probability of a sample to be detected, and determining a forest fire danger grade according to the forest fire danger occurrence probability of the sample to be detected. The forest fire danger grade determination method and the forest fire danger grade determination system based on the one-class SVM effectively overcome the problem of class imbalance due to concentration of the forest fire samples, and improve the accuracy degree of forest fire danger determination.
Owner:上海事凡物联网科技有限公司

Job shop real-time scheduling method based on PCA-XGBoost-IRF

The invention discloses a job shop real-time scheduling method based on PCA-XGBoost-IRF. The method comprises the steps of 1, constructing a standard data sample; 2, pre-processing the sample data, performing abnormal value processing, class imbalance processing and normalization processing on the sample data, and segmenting a data set to meet the input requirements for decision model construction; 3, carrying out feature engineering processing on a training set, wherein the feature engineering processing comprises feature extraction, feature importance calculation and feature selection; 4, carrying out decision model construction based on an improved random forest, including random forest model construction, improvement of an RF model to obtain an IRF model, and optimization of hyper-parameters of the IRF model based on grid search; 5, performing PCA-XGBoost-IRF decision model training based on the optimal parameters; and 6, realizing the real-time selection and decision-making of a dynamic job shop scheduling rule by using a decision-making model based on PCA-XGBoost-IRF. According to the present invention, the real-time scheduling method which is more reliable and higher in robustness and generalization is provided for the intelligent scheduling research based on data driving.
Owner:XINJIANG UNIVERSITY

A public opinion tendency identification method for training sample category distribution imbalance

The invention discloses a public opinion tendency identification method for training sample category distribution imbalance. The method comprises the steps of firstly collecting the vocabularies related to the concerned public opinion field as public opinion hot words to create a lexicon; crawling a comment data set from a public opinion information source and divided into a training set and a test set; then classifying the public opinion tendency of the training set manually, and for the problem of class imbalance, adopting a bootstrap learning method for supplementing processing; extractingfeatures of each type of training samples, training an algorithm model by adopting naive Bayes, a support vector machine, a decision tree and other algorithms, classifying test set data by using the trained model, and identifying public opinion tendency according to a classification result. The methods of bootstrap learning, feature vector construction and classification model training all adopt atime-sensitive weighting method for weighting, so that the public opinion tendency reflected by the methods is more timely. The public opinion tendency identification method solves the problem of inaccurate classification caused by imbalance of training data, and improves the accuracy of public opinion tendency identification and the timeliness of public opinion analysis.
Owner:WUHAN UNIV

Multi-class unbalanced remote sensing land cover image classification method based on integrated intervals

The invention discloses a multi-class unbalanced remote sensing land cover image classification method based on integrated intervals, and mainly solves the problem of low classification precision of unbalanced images in the prior art. According to the implementation scheme, the method comprises the following steps: acquiring an unbalanced training sample, and pre-classifying the unbalanced training sample by using a random forest classification algorithm; counting voting numbers of the pre-classified unbalanced training samples, and establishing an integrated interval model based on voting; sorting the unbalanced training samples according to the number of the samples and the integrated interval value, reserving the minimum class, randomly selecting the samples from the rest classes at anundersampling rate, and constructing a new balanced training subset; and inputting each new balance training subset into the CART decision tree, and generating an ensemble learning model through a main voting principle to obtain a final classification result of the unbalanced remote sensing image. The method can effectively reduce the loss of useful information during classification through the integrated interval model, is high in anti-noise capability, is high in training speed, and can be used for land cover and environment monitoring.
Owner:XIDIAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products