Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Rapid feature selection method based on whole genome sequence SNP data

A feature selection method, a genome-wide technology, applied in genomics, proteomics, instruments, etc., can solve problems such as low accuracy, low efficiency of feature selection for high-dimensional samples, limited improvement in analysis efficiency and accuracy, and achieve Improve the accuracy, improve the efficiency of SNP feature selection, and improve the effect of credibility

Pending Publication Date: 2021-08-06
HANGZHOU DIANZI UNIV
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the current SNP feature selection methods usually have problems such as low accuracy and low efficiency of high-dimensional sample feature selection.
For example, the Chinese patent document with the announcement number CN102629305B discloses a feature selection method for SNP data, which improves and combines the Relief algorithm and the SVM-RFE algorithm to perform feature selection on SNP data. Compared with the traditional single Relief As far as the algorithm and the SVM-RFE algorithm are concerned, the analysis efficiency and accuracy have been improved, but there are still the following problems: the combination of the two algorithms only synthesizes their respective advantages, and the existing problems have not been solved, so the analysis efficiency and a limited increase in accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rapid feature selection method based on whole genome sequence SNP data
  • Rapid feature selection method based on whole genome sequence SNP data
  • Rapid feature selection method based on whole genome sequence SNP data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0045] A fast feature selection method based on whole-genome sequence SNP data to figure 2The process shown is the basic framework, and the specific process is as follows figure 1 shown, including the following steps:

[0046] (A) Preprocess the whole genome sequence data to obtain all SNP sites, the specific process is as follows image 3 shown, including the following steps:

[0047] A1) Obtain reference sequences and test sequences from databases;

[0048] A2) Use MUMmer 3 software to compare the reference sequence and the test sequence to obtain all SNP sites.

[0049] (B) Use the FFS algorithm to filter irrelevant SNP features and retain the required SNP features. The specific process is as follows: Figure 4 shown, including the following steps:

[0050] Use all SNP sites obtained in step (A) as string Enter a regular expression re. findall (pattern, string, flags=0) In, output the required SNP features.

[0051] (C) Sorting the criticality of the SNP featu...

Embodiment 2

[0057] A fast feature selection method based on whole genome sequence SNP data, comprising the following steps:

[0058] (A) Preprocess the whole genome sequence data to obtain all SNP sites, the specific process is as follows:

[0059] A1) Obtain reference sequences and test sequences from databases;

[0060] A2) Use MUMmer 3 software to compare the reference sequence and the test sequence to obtain all SNP sites.

[0061] (B) Use the FFS algorithm to filter irrelevant SNP features and retain the required SNP features. The specific process is as follows:

[0062] Use all SNP sites obtained in step (A) as string Enter a regular expression re. findall (pattern, string, flags=0) In, output the required SNP features.

[0063] (C) Sorting the criticality of the SNP features retained in step (B), the specific process is as follows:

[0064] Add the SNP features retained in step (B) to the criticality ranking data set S In the data set S as iterable input expression So...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of SNP feature selection, and discloses a rapid feature selection method based on whole genome sequence SNP data, which comprises the following steps: (A) preprocessing the whole genome sequence data to obtain all SNP sites; (B) filtering irrelevant SNP features by adopting an FFS algorithm, and reserving required SNP features; (C) performing key degree sorting on the SNP features reserved in the step (B); (D) screening out significant SNPs according to a key degree sorting result. According to the method, the significant SNPs are screened out from all SNPs through SNP calling, rapid feature selection and key degree sorting, and the method has the advantages of being high in accuracy and high in high-dimensional sample feature selection efficiency and can be used for positioning pathogenic genes, drug-resistant genes and the like.

Description

technical field [0001] The invention relates to the technical field of SNP feature selection, in particular to a rapid feature selection method based on whole genome sequence SNP data. Background technique [0002] In the field of data mining, feature selection is the process of screening out the minimum feature subset for identifying targets from the original feature set. Feature selection for data with high-dimensional and small-sample characteristics is one of the research hotspots in this field. Feature selection can reduce the dimensionality of the feature space, thereby speeding up the overall calculation speed of the model; screening out redundant "noise" data; increasing the classification accuracy of the classification model; and helping to reveal the information contained in the data. Commonly used data analysis methods have the characteristics of sample tendency, and the efficiency and accuracy of sample data analysis are low. [0003] Single nucleotide polymorp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G16B20/20
CPCG16B20/20
Inventor 刘文佳应南娇
Owner HANGZHOU DIANZI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products