Unbalanced data set preprocessing method based on neighborhood information

A technology of neighborhood information and data sets, which is applied in the fields of electrical digital data processing, special data processing applications, digital data information retrieval, etc. It can solve problems such as imbalance, unsatisfactory classification results, and low recall rate of minority classes. , to achieve high precision, expand information expression ability, and reduce the possibility of over-fitting phenomenon

Pending Publication Date: 2021-05-25
ANHUI NORMAL UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Data identification and classification have been widely used in real life, and many mature classification algorithms have emerged. For balanced data sets, most of these traditional algorithms have a good classification result, and in practical applications, very Most datasets are unbalanced datasets. The number of samples of a certain type in a dataset is significantly less than the number of other samples. This kind of dataset is called an unbalanced dataset, such as common medical diagnosis datasets in life, For network identification data sets, harassment and interception data sets, etc., the traditional classification method is more inclined to the overall classification accuracy rather than the minority class data. When the distribution of the data set is uneven and the minority class overlaps with the majority class, the traditional classification algorithm will The samples at the edge of the classification hyperplane are directly classified into the majority class samples to improve the overall classification accuracy. At the same time, the recall rate of the minority class is greatly reduced. Although the minority class data has little impact on the overall accuracy, it is extremely important. For example, a There are 1000 samples in the data set, 999 majority classes and 1 minority class, then all the data is divided into majority classes by traditional classification methods, then the overall classification accuracy can reach 99.9%, but for the minority class, the classification 100% error rate
With the popularity of its learning in real life, people have higher requirements for the recognition rate of minority classes in unbalanced datasets. At the same time, traditional classification methods have various defects in the identification and classification of unbalanced datasets. cannot provide satisfactory classification results, therefore, the research on classification methods for imbalanced datasets is an urgent problem to be solved at present, and has important theoretical and practical significance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced data set preprocessing method based on neighborhood information
  • Unbalanced data set preprocessing method based on neighborhood information
  • Unbalanced data set preprocessing method based on neighborhood information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with specific embodiments.

[0034] It should be noted that, unless otherwise defined, the technical terms or scientific terms used in one or more embodiments of the present specification shall have ordinary meanings understood by those skilled in the art to which the present disclosure belongs. "First", "second" and similar words used in one or more embodiments of the present specification do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanica...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides an unbalanced data set preprocessing method based on neighborhood information. The method comprises the following steps of S1, randomly selecting a minority class sample in a target unbalanced data set; S2, constructing a neighborhood of a spatial hypersphere shape with the selected minority class sample as the center; S3, judging whether all constructed neighborhoods contain all minority class samples or not, if yes, executing the step S4, and if not, returning to execute the step S1; S4, determining weights of all neighborhoods according to the number of minority class samples contained in the neighborhoods; S5, determining the number of new samples needing to be synthesized in each neighborhood; and S6, synthesizing a new sample for each neighborhood, according to the preprocessing method, the synthesized new sample region is constrained through detection in geometric space, different sampling multiplying powers are provided for the samples with different subsequent classification contribution degrees through a weighted synthesis strategy, identification and classification of a traditional algorithm are more convenient, and the precision is higher.

Description

technical field [0001] The invention relates to the technical field of data mining and data preprocessing, in particular to a method for preprocessing unbalanced data sets based on neighborhood information. Background technique [0002] With the rapid development of the mobile Internet of Things in the information age, the amount of data in various industries and fields has also shown explosive growth. How to identify a small amount of truly meaningful data in the massive data has become a problem in the field of data mining in the field of machine learning. Hotspots and Difficulties. [0003] Data identification and classification have been widely used in real life, and many mature classification algorithms have emerged. For balanced data sets, most of these traditional algorithms have a good classification result, and in practical applications, very Most datasets are unbalanced datasets. The number of samples of a certain type in a dataset is significantly less than the n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/2458G06K9/62
CPCG06F16/2465G06F18/24143
Inventor 郭威王再见赵仁习
Owner ANHUI NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products