Method for keeping balance of implementation class data through local mean

A technology of local mean and data balance, applied in the field of information, can solve the problem of destroying the local consistency of data, and achieve the effect of obvious class balance, maintaining local consistency, and improving classification accuracy.

Inactive Publication Date: 2012-06-13
SHANDONG NORMAL UNIV
View PDF2 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

This type of method balances the data not by copying but by generating new data, which avoids the overfitt...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for keeping balance of implementation class data through local mean
  • Method for keeping balance of implementation class data through local mean
  • Method for keeping balance of implementation class data through local mean

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0033] Example 1: Balanced processing of medical image data for diagnosis of lung cancer

[0034] Explanation: Among the lung medical images taken by patients, most of them are images diagnosed as non-cancer, and only a few images are diagnosed as suffering from cancer. These images are first used to train the classification algorithm, and the newly-shot Before image diagnosis, it is necessary to balance the amount of the two types of data, otherwise, the diagnostic accuracy of cancer will be very low.

[0035] figure 1 In , the data balancing process is as follows:

[0036] 1) Collect medical images of lung medical diagnosis patients, and mark them as non-lung cancer patients and lung cancer patients according to the doctor's diagnosis results, and the image data of cancer is a minority type of data;

[0037] 2) Using general software tools such as matlab to convert medical images into multi-dimensional vector data, calculate the ratio of the numbers of the two types of ima...

Embodiment 2

[0044] Example 2: DNA data balance for abnormal DNA strand identification

[0045] Explanation: Most of the DNA images are images of normal chain structures, and only a few are images of abnormal chain structures. It is time-consuming and labor-intensive to mark by manual methods, and it needs to be completed with computer assistance. Before using the labeled data to train the classification algorithm, it is necessary to balance the amount of the two types of data, otherwise, when the learned classification algorithm recognizes new images, the recognition accuracy of the abnormal chain structure will be very low.

[0046] The data balancing process is as follows:

[0047] 1) Collect artificially marked DNA image data, and the marks are divided into normal chain image data and abnormal chain image data;

[0048] 2) according to embodiment 1 from step 2) to step 5) the same method balance data;

[0049] 3) Subsequent processing: the above-mentioned class-balanced DNA data is f...

Embodiment 3

[0050] Embodiment 3: web page data balance is used for spam page identification

[0051] Explanation: Most of the web pages are normal pages, and only a few are spam pages. It is time-consuming and labor-intensive to use manual methods to mark them, and it needs to be completed with the help of computers. Before using the marked data to train the classification algorithm, it is necessary to balance the amount of manually marked two types of page data. Otherwise, when the learned classification algorithm distinguishes new pages, if the new page is a spam page, the accuracy of being correctly identified will be reduced. will be low.

[0052] The data balancing process is as follows:

[0053] 1) Each web page manually marked as normal and garbage is represented by a popular VSM (vector space model), that is, each page is represented by a vector;

[0054] 2) according to embodiment 1 from step 3) to step 5) the same method balance data;

[0055] 3) Subsequent processing: the ab...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for keeping balance of implementation class data through a local mean, which comprises the following steps: (1) distinguishing a minority class through acquiring training data; and calculating the number of majority class data and minority class data, and calculating an integer of the ratio of the number of the majority class data to the number of the minority class data; (2) calculating k neighbors in the minority class for each data in the minority class, and generating new data through weighing the k neighbors; (3) repeatedly generating new data for each data through adjusting parameters in weight and utilizing weighted summation of the k neighbors of each data; (4) marking the new data as the minority class, and merging the new data and original data to obtain balanced two classes data; and (5) further processing the balanced two classes data, i.e. a training sorting algorithm, and realizing sorting of the new unmarked data. According to the invention, the accuracy of medical diagnosis can be improved, the recognition rate of network attack is improved, the recognition rate of server failure is improved, the recognition of garbage pages is improved, and the like.

Description

technical field [0001] The invention relates to a method for realizing class data balance through local mean value maintenance, which belongs to the field of information technology. Background technique [0002] In production and life, we need to process all kinds of data in order to find useful information from the data, such as analyzing a large number of satellite images to determine the location of oil exploration; comparing a large number of medical image data to determine whether a patient suffers from a certain disease; From a large amount of network login and access information, discover which is normal access and which is malicious access; find out which is abnormal information from a large amount of collected server health operation information, so as to take necessary measures; find abnormality from a large number of DNA structures Structural information to obtain the underlying causes of different diseases. A large number of problems similar to the above appear,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张化祥张悦童
Owner SHANDONG NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products