Metagenome characteristic selecting method based on variable importance scoring and Neyman-Pearson detection

A feature selection method, metagenomic technology, applied in the field of metagenomic feature selection based on variable importance score and Neiman Pearson test, can solve the problem that random forest is susceptible to noise, achieve good stability and classification effect, Good robustness, convenient medical verification test effect

Active Publication Date: 2019-08-30
XI AN JIAOTONG UNIV
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, metagenomic abundance data contains a large number of irrelevant features, and the feature importance score of random forest is easily affected by noise.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Metagenome characteristic selecting method based on variable importance scoring and Neyman-Pearson detection
  • Metagenome characteristic selecting method based on variable importance scoring and Neyman-Pearson detection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention will be further described below in conjunction with the accompanying drawings and embodiments. This embodiment is aimed at the cirrhosis of the liver (Cirrosis of Liver, CIR) metagenomic data set. The cirrhosis data set is collected from the intestinal tract, including 232 samples, including 118 cases of cirrhosis. There were 114 cases in the control group, involving 532 operable taxa.

[0026] refer to figure 1 , a metagenome feature selection method based on variable importance scores and Neyman-Pearson tests, including the following steps:

[0027] Step A: For the cirrhosis operational unit dataset, calculate the correlation of each microbial signature with the sample phenotype using symmetric uncertainty, sort by size, select the top 200 features as a feature subset, and generate subdata of the original data set for subsequent analysis.

[0028] Step B: First sample the sub-dataset with replacement sampling, then calculate the variable import...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a metagenome characteristic selecting method based on variable importance scoring and Neyman-Pearson detection. The method comprises the steps of 1, for one input metagenome classification operable unit dataset, calculating correlation between each microbe characteristic and a sample phenotype by means of symmetric uncertainty, screening the characteristics according to correlation scores, and generating a sub-dataset; 2, sampling the sub-dataset in a sampling-with-replacement mode, selecting first k characteristics by means of variable importance scoring, iterating thesteps, and after iteration, performing statistics on the number-of-appearances of each characteristic; and 3, calculating a threshold on the condition of a given parameter by means of a Neyman-Pearson detection method, performing screening and determining the characteristics with the number-of-appearances which is higher than the threshold as a candidate characteristic set, and determining the front k characteristics with highest number-of-appearances as a target characteristic subset. The metagenome characteristic selecting method has advantages of remarkably improving classification result,realizing higher stability and facilitating subsequent medical experiment of the metagenome by the generated candidate characteristic set.

Description

technical field [0001] The invention belongs to the field of analysis of metagenomic abundance data, and in particular relates to a method for selecting metagenomic features based on variable importance scores and Neyman-Pearson tests. Background technique [0002] The core problem of metagenomic abundance data analysis is to effectively identify a small number of microorganisms that have potential effects on phenotypes from a large number of microorganisms, which is widely used in medicine, biology, environmental science, food science and other disciplines. The metagenomic data set contains a wide variety of microorganisms, and it takes a lot of work to directly study the impact of all microorganisms on the phenotype. Therefore, it is necessary to use other methods to remove the noise of the original data set and leave the microorganisms that have the potential to affect the phenotype, that is, macro. Feature selection for genomic data. [0003] In order to effectively ide...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/00G16B25/00
CPCG16B25/00G16B40/00
Inventor 宋永红丁志文张元林
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products