Data classification method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of data classification and classification methods, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as low modeling accuracy, reduced sample classification accuracy, and inaccurate estimation of regression coefficients

Inactive Publication Date: 2011-05-18

HEFEI JOYIN INFORMATION TECH

View PDF2 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] In view of this, the object of the present invention is to provide a data classification method and system to solve the inaccurate estimation of the regression coefficient caused by the local differences in the correlation between the key variable and the target variable in the prior art, which in turn leads to the construction The problem of low model accuracy and reduced sample classification accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0034] The flow chart of the data classification method provided by the embodiment of the present invention is as follows figure 1 shown, including:

[0035] S101: Calculate the correlation coefficient between each sample variable and a preset target variable, and under other sample variable conditions, the partial correlation coefficient between each sample variable and the target variable;

[0036] The formula for calculating the correlation coefficient is:

[0037] φ YX 1 = Σ i = 1 N ( X 1 i - X ‾ 1 ) ( ...

Embodiment 2

[0071] see figure 2 , shows a flow chart of Embodiment 2 of a data classification method of the present invention, where the sample variables in the original sample set need to be extracted and filled before the segmentation variables are selected. The second embodiment includes the following steps:

[0072] S201: Calculate the missing ratio of each sample variable in the original sample set, and select the sample variable that meets the missing ratio condition according to the missing ratio;

[0073] S202: Calculating the mean values of the selected sample variables meeting the missing ratio condition respectively, and filling the mean values of the selected sample variables meeting the missing ratio condition;

[0074] The missing ratio condition is that the missing ratio of the variable is not greater than 30%. Of course, the missing ratio condition is not fixed, and it is determined according to the specific situation of the missing variable of the sample. The follo...

Embodiment 3

[0084] After modeling the training subsets one by one to generate a model describing the data, it is also necessary to judge the prediction effect of the model to determine whether the model has achieved the best prediction effect. Therefore, after classifying the sample variables in the test subset, it also includes: model The judgment process of the prediction effect, such as image 3 shown, including:

[0085] S301 to S311: the same as steps S201-S211 in the second embodiment;

[0086] S312: Judging whether the model has achieved the best prediction effect, if yes, execute S313, otherwise, execute S314;

[0087] Specifically, this step includes the following steps, such as Figure 4 Shown:

[0088] S3121: Obtain the probability value that the target variable takes a value of 1 from the probability value calculated in step S311;

[0089] S3122: Merge the probability values, and sort from large to small according to the magnitude of the values;

[0090] For example: the pr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses data classification method and system. The data classification method comprises the following steps of: selecting a segmentation variable; carrying out segmentation and layering on an original sample set according to the segmentation variable and a target variable to obtain a training subset and a testing subset; selecting a key variable in the training subset, calculating a regression coefficient, modeling the training subset one by one by applying a regression model according to the key variable and the regression coefficient so as to generate a model for describing data; and substituting the sample variable in the testing subset into the model, calculating the probability value of the sample and classifying the sample according to the probability value. By applying the technical scheme, the original sample set is firstly segmented according to the segmentation variable before the key variable is selected, therefore, the local differentiation of the key variable is effectively eliminated, the modeling accuracy is improved, and the sample classification accuracy is further improved.

Description

technical field [0001] The invention relates to the technical field of data mining, in particular to a data classification method and system. Background technique [0002] The classification system is one of the main systems of data mining, which usually extracts key variables from the original sample set, and calculates the regression coefficient through existing standard software such as SAS (Statistical Analysis Software, statistical analysis software) and simulation software MATLAB, according to The key variables and regression coefficients are modeled using the Logistic regression model, and the user predicts the future development trend of the data according to the model obtained by the modeling, so as to make correct operations according to the trend. [0003] Due to the local difference in the correlation between the key variable extracted from the entire original sample set and the target variable, this local difference will cause the phenomenon of "positive and neg...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

Inventor 储晨

Owner HEFEI JOYIN INFORMATION TECH

Data classification method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology