Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Unbalanced data processing method for synthesizing minority class samples based on relationship between features

A technology of relationship synthesis and data processing, applied in the direction of instruments, character and pattern recognition, calculation models, etc., can solve the problems of not considering the relationship between the characteristics of the underlying data, over-fitting, and discarding information samples, so as to solve the problem of sample imbalance , the effect of improving the classification accuracy

Pending Publication Date: 2022-03-08
XIAN UNIV OF POSTS & TELECOMM
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in most cases, when oversampling or undersampling, the relationship between the underlying data features is not considered. Oversampling may lead to overfitting problems, while undersampling may discard samples with large amounts of information.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data processing method for synthesizing minority class samples based on relationship between features
  • Unbalanced data processing method for synthesizing minority class samples based on relationship between features
  • Unbalanced data processing method for synthesizing minority class samples based on relationship between features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0049] An unbalanced data processing method for synthesizing minority class samples based on the relationship between features, such as figure 1 As shown, it specifically includes the following steps:

[0050] 1) Using a pareto-based multi-objective feature selection algorithm to select multiple pareto features with high target values.

[0051] 1.1) Initialize the population and set the number of population iterations to 1000. Taking the glass data set as an example, the glass data has 9 features in total, and the imbalance ratio is 3.18. In the multi-objective optimization process, maxS(c)=(F( D,c),G(D,c),A(D,c),P(c)) is the objective function, there are 4 target values ​​in total, F(D,c) is the selected feature set in the unbalanced data set The size of the F_score value on D, G(D,c) is the size of the G_mean value of the selected feature set on the unbalanced data set D, A(D,c) is the selected feature set on the unbalanced data set D The size of the AUC value, P(c) repres...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to an unbalanced data processing method for synthesizing minority class samples based on relationships between features. The method comprises the steps that firstly, when minority class data are sampled, for high-dimensional data with redundancy characteristics, the redundancy characteristics can influence the performance of a sampling algorithm, a multi-target-based pareto leading edge characteristic method is introduced, three evaluation indexes including an AUC value, an Fscore value and a Gmean value are used as fitness functions, and the performance of a sampling algorithm is improved through continuous iterative optimization; the pareto leading edge features with the three high evaluation indexes are selected; then, performing feature sampling on pareto frontier features of a minority class of samples, using XGBoost regression to capture a relationship between the features, considering a quality problem of a new sample while generating the sample, and designing an evaluation index DIS for calculating sample quality based on Euclidean distance in order to evaluate the quality of the new sample. And finally, newly generated samples are added into an original data set, a plurality of balanced sample sets are synthesized, and a final classification result is output by using majority voting integration.

Description

technical field [0001] The invention relates to the technical fields of machine learning and data mining, in particular to an unbalanced data processing method for synthesizing minority class samples based on the relationship between features. Background technique [0002] With the continuous expansion of data availability in many large-scale, complex and networked systems such as medical care, security, Internet and finance, how to intelligently process data and extract valuable information from data has become a research hotspot in theory and application. Although existing knowledge discovery and data engineering techniques have achieved great success in many real-world applications, the problem of learning from unbalanced data is still a relatively difficult challenge, which is a common problem in different life domains. A common phenomenon, misclassification of minority classes is extremely costly, and has attracted increasing attention from both academia and industry. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06V10/764G06V10/774G06K9/62G06N3/00
CPCG06N3/006G06F18/214G06F18/24
Inventor 潘晓英贾蓉张国鑫王昊
Owner XIAN UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products