Unlock instant, AI-driven research and patent intelligence for your innovation.

A Feature Selection Method Based on Distribution Shift Dataset

A feature selection method and data set technology, applied in the field of machine learning, can solve problems such as the failure of feature subsets or feature sorting lists, and achieve the effect of improving operating efficiency and effect.

Active Publication Date: 2019-03-05
上海晶赞企业管理咨询有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Using traditional feature selection methods, whether it is a filter or a wrapper, when encountering a distribution drift data set, the selected feature subset or feature ranking list will be invalid due to data distribution drift

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Feature Selection Method Based on Distribution Shift Dataset
  • A Feature Selection Method Based on Distribution Shift Dataset

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] The present invention is a feature selection method based on distribution drift data set, by inputting distribution drift data set and feature candidate set, and considering the degree of correlation between features and labels and the drift degree of features over time to obtain the final feature candidate Subset and feature sorted lists.

[0043] The feature selection method of the present invention is based on a feature evaluation index: feature generalization ability effectiveness score FGES. The feature generalization ability effectiveness score FGES is a brand-new concept proposed by the present invention, and its calculation combines the feature correlation score FRS and the feature drift score FSS. The feature correlation score FRS refers to the degree of correlation or importance between features and labels; the feature drift score refers to the degree to which feature distribution changes over time or the combination of feature labels changes over time.

[00...

Embodiment 2

[0083] Calculate the feature generalization ability effectiveness score FGES:

[0084] Given data set D and feature candidate set F, feature candidate set F={A, B, C, D, E, F, G, H, I, J}; for each feature in feature candidate set F, calculate Feature correlation score (FRS), this embodiment uses the "mutual information of features and labels" method to calculate FRS, the FRS of each feature refers to the corresponding column in Table 1 below; for each feature in the F set, calculate the degree of feature drift Score FSS, the present embodiment adopts the "KL distance of feature" method to calculate FSS, and the FSS of each feature refers to the corresponding column of Table 1 below; for each feature in the F set, use FGES=log(FRS) / log(FSS ) fusion method to calculate FGES, the FGES of each feature is shown in the corresponding column of Table 1 below.

[0085] Table I

[0086] Features

Embodiment 3

[0088] Distribution shift dataset filter feature selection method:

[0089] (1) Given data set D, feature candidate set F, the number of features to be selected N; in this embodiment, F={A, B, C, D, E, F, G, H, I, J}, N=4.

[0090] (2) Choose a method to calculate the feature correlation score FRS of each feature in the feature candidate set F; in this embodiment, the method of "mutual information between features and labels" is used to calculate FRS, and the specific values ​​refer to the correspondence in Table 1 List.

[0091] (3) Select a method to calculate the feature drift degree score FSS of each feature in the feature candidate set F; in the present embodiment, the method of "feature KL distance" is used to calculate FSS, and the specific values ​​are listed in the corresponding column of Table 1;

[0092] (4) Select a method to calculate the feature generalization ability effectiveness score FGES of each feature in the feature candidate set F; in this embodiment, t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed drift data set-based feature selection algorithm which comprises two versions: a filter and a wrapper. According to the algorithm, a feature generalization effective score (FGES) is introduced to solve the feature drift problem; and under the setting of a given data set D, a feature candidate set F and a to-be-selected feature number N, the top N features which are most effective for the classification problem and sorting of the N features can be generated. According to the algorithm, when a machine learning classification algorithm is used for facing a distributed drift data set, the filter and the wrapper methods still can be used for carrying out feature selection, so as to improve the operation efficiency, expandability and model effect of the machine learning classification algorithm.

Description

technical field [0001] The present invention relates to the problem of feature selection and feature ranking in the field of machine learning, and in particular to the distribution drift dataset filter feature selection method (DDFSF) and the distribution drift dataset wrapper feature selection based on feature generalization effectiveness score (FGES) method (DDFSW). Background technique [0002] In recent years, with the development of the big data industry, many industries have generated massive amounts of data, including data types, data scale, and data dimensions are constantly expanding. In order to discover knowledge and value from large amounts of data, machine learning algorithms are increasingly used in industry. In addition to the continuous expansion of data samples, the types and dimensions of data features are also growing rapidly, and the feature dimensions can reach tens of millions or even larger. A large number of features will bring some problems to the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/28
CPCG06F16/285
Inventor 汤奇峰薛守辉
Owner 上海晶赞企业管理咨询有限公司