Unbalanced data set preprocessing method based on neighborhood information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of neighborhood information and data sets, which is applied in the fields of electrical digital data processing, special data processing applications, digital data information retrieval, etc. It can solve problems such as imbalance, unsatisfactory classification results, and low recall rate of minority classes. , to achieve high precision, expand information expression ability, and reduce the possibility of over-fitting phenomenon

Pending Publication Date: 2021-05-25

ANHUI NORMAL UNIV

View PDF0 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] Data identification and classification have been widely used in real life, and many mature classification algorithms have emerged. For balanced data sets, most of these traditional algorithms have a good classification result, and in practical applications, very Most datasets are unbalanced datasets. The number of samples of a certain type in a dataset is significantly less than the number of other samples. This kind of dataset is called an unbalanced dataset, such as common medical diagnosis datasets in life, For network identification data sets, harassment and interception data sets, etc., the traditional classification method is more inclined to the overall classification accuracy rather than the minority class data. When the distribution of the data set is uneven and the minority class overlaps with the majority class, the traditional classification algorithm will The samples at the edge of the classification hyperplane are directly classified into the majority class samples to improve the overall classification accuracy. At the same time, the recall rate of the minority class is greatly reduced. Although the minority class data has little impact on the overall accuracy, it is extremely important. For example, a There are 1000 samples in the data set, 999 majority classes and 1 minority class, then all the data is divided into majority classes by traditional classification methods, then the overall classification accuracy can reach 99.9%, but for the minority class, the classification 100% error rate

With the popularity of its learning in real life, people have higher requirements for the recognition rate of minority classes in unbalanced datasets. At the same time, traditional classification methods have various defects in the identification and classification of unbalanced datasets. cannot provide satisfactory classification results, therefore, the research on classification methods for imbalanced datasets is an urgent problem to be solved at present, and has important theoretical and practical significance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0033] In order to make the purpose, technical solutions and advantages of the present disclosure clearer, the present disclosure will be further described in detail below in conjunction with specific embodiments.

[0034] It should be noted that, unless otherwise defined, the technical terms or scientific terms used in one or more embodiments of the present specification shall have ordinary meanings understood by those skilled in the art to which the present disclosure belongs. "First", "second" and similar words used in one or more embodiments of the present specification do not indicate any order, quantity or importance, but are only used to distinguish different components. "Comprising" or "comprising" and similar words mean that the elements or items appearing before the word include the elements or items listed after the word and their equivalents, without excluding other elements or items. Words such as "connected" or "connected" are not limited to physical or mechanica...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides an unbalanced data set preprocessing method based on neighborhood information. The method comprises the following steps of S1, randomly selecting a minority class sample in a target unbalanced data set; S2, constructing a neighborhood of a spatial hypersphere shape with the selected minority class sample as the center; S3, judging whether all constructed neighborhoods contain all minority class samples or not, if yes, executing the step S4, and if not, returning to execute the step S1; S4, determining weights of all neighborhoods according to the number of minority class samples contained in the neighborhoods; S5, determining the number of new samples needing to be synthesized in each neighborhood; and S6, synthesizing a new sample for each neighborhood, according to the preprocessing method, the synthesized new sample region is constrained through detection in geometric space, different sampling multiplying powers are provided for the samples with different subsequent classification contribution degrees through a weighted synthesis strategy, identification and classification of a traditional algorithm are more convenient, and the precision is higher.

Description

technical field [0001] The invention relates to the technical field of data mining and data preprocessing, in particular to a method for preprocessing unbalanced data sets based on neighborhood information. Background technique [0002] With the rapid development of the mobile Internet of Things in the information age, the amount of data in various industries and fields has also shown explosive growth. How to identify a small amount of truly meaningful data in the massive data has become a problem in the field of data mining in the field of machine learning. Hotspots and Difficulties. [0003] Data identification and classification have been widely used in real life, and many mature classification algorithms have emerged. For balanced data sets, most of these traditional algorithms have a good classification result, and in practical applications, very Most datasets are unbalanced datasets. The number of samples of a certain type in a dataset is significantly less than the n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/2458G06K9/62

CPCG06F16/2465G06F18/24143

Inventor郭威王再见赵仁习

OwnerANHUI NORMAL UNIV

Unbalanced data set preprocessing method based on neighborhood information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology