Gene selection method based on feature discrimination and independence

A gene selection and identification technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as overfitting of small sample data sets and high time overhead.

Inactive Publication Date: 2017-09-22
SHAANXI NORMAL UNIV
View PDF3 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The Wrapper method relies on the learning process. The feature subset generation process is completed based on the performance of the classification model based on the corresponding feature subset in the verification set. Generally, a feature subset with better performance and smaller scale than the Filter method is selected, but The classification model needs to be trained multiple times, which takes a lot of time and is prone to "overfitting" problems on small sample data sets

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene selection method based on feature discrimination and independence
  • Gene selection method based on feature discrimination and independence
  • Gene selection method based on feature discrimination and independence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0067] In this embodiment, the feature selection method based on feature recognition and independence is implemented by the following steps:

[0068] (1) Randomly generate the first type of data set D conforming to the normal distribution 1 , denoted as D 1 ={X 1 ;X 2 ;…;X 10}∈R 10 ×50 , randomly generate the second type of data set D conforming to the normal distribution 2 , denoted as D 2 ={X 11 ;X 12 ;…;X 20}∈R 10×50 , data set D 1 and D 2 Each contains 10 samples, and each sample has 50 features. Dataset D 1 and D 2 Merge into a data set D, expressed as D={X 1 ;X 2 ;…;X 20}∈R 20×50 , which contains 20 samples, distributed in 2 categories, each sample contains 50 features, and then use the bootstrap method to divide the data set to obtain the training set and test set.

[0069] (2) Calculate the recognition degree of each feature

[0070] (2.1) Use the Wilcoxon rank sum test method to calculate the weight w of each feature in the data set D i ,specific...

Embodiment 2

[0089] In step (2) of this embodiment, the weight w of each feature in the data set D i The calculation method of can also be calculated by the D-Score method. D-Score is a feature weight calculation method based on intra-class and inter-class distances. The specific calculation formula is as follows:

[0090]

[0091] Among them, D i Indicates that the fth in the data set D i The D-Score value of a feature, that is, the fth i The weight of features, c is the number of categories in the data set, are the mean values ​​of the i-th feature on the entire data set and the j'th class data set, respectively, is the eigenvalue of the i-th feature of the v-th sample point in the j'th class.

[0092] Other steps are identical with embodiment 1.

Embodiment 3

[0094] In step (2) of this embodiment, the weight w of each feature in the data set D i The calculation method of can also be calculated by the method based on mutual information. Mutual information is used to evaluate the correlation between two features or between features and class labels. The calculation formula is as follows:

[0095] I(f i ,Y)=H(Y)-H(Y|f i )

[0096] Among them, Y represents the class label vector of the data set; I(f i , Y) represents the feature f in the data set i The mutual information value between and the class label vector Y, that is, the feature f i weight; H(Y) is the information entropy of the class label vector Y; H(Y|f i ) for the feature f i The information entropy of the class label vector Y under the condition of determining the value.

[0097] For continuous features, it needs to be discretized in advance.

[0098] Other steps are identical with embodiment 1.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a feature selection method and application based on feature identification degree and independence. The method comprises following steps: calculating the importance degree of each feature by measuring inter-class distinguished ability with feature identification degree and measuring correlational relationship between features with feature independence and sequencing in a descending order; and selecting top k features with the importance higher than those of others to form a feature subset with high class-discrimination performance. Differently-expressed gene subsets selected in application of oncogene expression profile data obtain fine time and class discrimination performance. The feature selection method and application based on feature identification degree and independence have following beneficial effects: easy calculations can be made; time complexity is reduced; selection efficiency runs high; and a good reference is provided for clinical diagnoses and judgments of tumors and other diseases.

Description

technical field [0001] The present invention relates to a feature selection method based on feature recognition and independence and its application in tumor gene expression profile data, and specifically relates to the technical field of preprocessing for bioinformatics tumor expression profile gene data mining and analysis. Gene selection methods for gene expression profiling. Background technique [0002] The emergence of high-dimensional data with a large number of redundant and irrelevant features has brought great challenges to machine learning and data mining algorithms. On the premise of keeping the data classification ability unchanged, feature selection selects features from the original feature set that are highly related to the category, are as uncorrelated as possible, and contain most or all of the classification information of the original feature set to form a feature subset. The classification model of the feature subset is more accurate and understandable,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/24G06F19/20
CPCG16B25/00G16B40/00
Inventor 谢娟英王明钊
Owner SHAANXI NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products