Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F

A random forest and whole genome technology, applied in the field of data processing, can solve the problems of difficult identification of rare SNP loci and unsatisfactory effect of single locus identification, and achieve the effect of reducing dimension

Active Publication Date: 2015-03-25
西安电子科技大学重庆集成电路创新研究院
View PDF2 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] (1) Most of the current GWAS models only consider the association between a single SNP locus and complex diseases, ignoring the fact that SNP loci can affect complex diseases by interacting with other SNP loci or environmental factors
[0005] (2) It is difficult to identify rare SNP sites. The frequency of such SNP sites in the normal population is between 1% and 5%

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F
  • Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F
  • Genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The present invention will be further described below in conjunction with the accompanying drawings. It should be noted that this embodiment is based on the technical solution and provides detailed implementation steps and specific operation methods, but the present invention is not limited to this embodiment.

[0038] refer to figure 1 , the specific implementation steps of the present invention are as follows.

[0039] Step 1, preprocessing the SNP data:

[0040] If the sample data is in the base pair form of AA, encode each SNP site into the number of the smallest allele; if the smallest allele is a, then according to the number of occurrences of the smallest allele , and the genotypes AA, Aa, and aa are coded as 0, 1, and 2, respectively. Remove the SNP sites whose minimum allele frequency is less than the set value. The set value was set at 0.05. The purpose of removing the SNP sites whose minimum allele frequency is less than the set value is to filter out the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a genome-wide SNP (single nucleotide polymorphism) site analysis method based on combination of random forest and Relief-F. The method includes: primarily screening SNP sites with a generalized linear model; processing SNP interactive capability with Relief-F; preliminarily putting SNP sites, which are interactive, to the front of a queue; ranking the SNP sites at the rear of the queue with the random forest to recognize edge action of each single SNP site so as to obtain an SNP rank queue; removing the SNP sites at the tail of the queue; performing processing again with the Relief-F and the random forest; allowing iteration to obtain a ranking result of the SNP sites. The method has the advantages that the action of each single SNP site and interaction of the SNP sites are comprehensively considered, genome-wide SNP data can be processed so as to find those related to complex diseases, and the method is significant to the research on pathogenesis of the complex diseases, prediction on risk of diseases, development of biological drugs and the like.

Description

technical field [0001] The invention belongs to the technical field of data processing, conducts whole-genome association research based on whole-genome single nucleotide polymorphism (SNP, Single Nucleotide Polymorphisms) data, identifies disease-related SNP sites, and can be used to explain the pathogenic mechanism of complex diseases , disease risk prediction, and biopharmaceutical development. Background technique [0002] Bioinformatics is an emerging discipline combining life science and computer science. It studies the collection, processing, storage, dissemination, analysis and interpretation of biological information, and reveals complex biological information through the comprehensive use of biology, computer science and information technology. Biological mysteries hidden in data. The basic principle of Genome-Wide Association Studies (GWAS, Genome-Wide Association Studies) is to select a certain number of samples from the case group and the control group in the s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/18
Inventor 杨利英黎成殷黎洋张军英袁细国
Owner 西安电子科技大学重庆集成电路创新研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products