Method for clustering single nucleotide polymorphism (SNP) data

A clustering method and data technology, applied in the computer field, can solve problems such as inappropriate classification of data, and achieve the effects of convenient and efficient addition and deletion operations, reduction of processing efficiency, and improvement of execution efficiency.

Inactive Publication Date: 2013-04-03
SHANGHAI UNIV
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the subspace clustering algorithm is only suitable for continuous data, not for categorical data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for clustering single nucleotide polymorphism (SNP) data
  • Method for clustering single nucleotide polymorphism (SNP) data
  • Method for clustering single nucleotide polymorphism (SNP) data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] see figure 1 , this clustering method for SNP data, is characterized in that:

[0042] A. Preprocess the original SNP data and convert it into a data format that can be processed by the clustering method;

[0043] B. Mesh the preprocessed SNP data;

[0044] C. Calculate the density of the divided grid to obtain the subspace containing the clusters;

[0045] D. Cluster the subspaces obtained in step C to obtain the classified SNP data;

[0046] E. Save the clustering results to a file.

Embodiment 2

[0048] This embodiment is basically the same as Embodiment 1, and the special features are as follows:

[0049] see Figure 2 ~ Figure 4 , in the step A, the original SNP data is preprocessed, and the operation steps of converting into a data format that can be processed by the clustering method are as follows:

[0050] A1) Data coding: The data format derived from SNP chip detection is as follows. Each SNP site is a typing result. There are four typing results in total, which are wild homozygous AA, mutant heterozygous AB, Mutation homozygous BB and genotyping failure mark NC; SNP data AA is coded as 0, AB is coded as 1, and BB is coded as 2;

[0051] A2) Data cleaning: if a whole row of data is NC, then the whole row of data will be deleted; if there are several NC data in a row, these NC data will be replaced with the same position of the next sample Data value; if there is more than 10% of NC data in a row, the entire row of data will be deleted.

[0052] The...

Embodiment 3

[0070] refer to Figure 1 ~ Figure 4 , a kind of clustering method for SNP data of the present invention, take the SNP data clustering of hypertensive patients as an example, its specific steps are as follows:

[0071] (1) Preprocess the original SNP data and convert it into a data format that can be processed by the clustering method, such as figure 2 As shown, the specific steps are as follows:

[0072] a) Data coding: The data format derived from SNP chip detection is as follows. Each SNP site is a typing result. There are four typing results in total, which are wild homozygous AA, mutant heterozygous AB, Mutation homozygous BB and genotyping failure mark NC; SNP data AA is coded as 0, AB is coded as 1, and BB is coded as 2;

[0073] b) Data cleaning: some whole row of data is NC, then this whole row of data will be deleted, and some have several NC data in one row, then these few NC data will be replaced with the same position of the next sample Data value; i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for clustering single nucleotide polymorphism (SNP) data. The method comprises the following steps of: firstly, pre-processing original SNP data, and converting the format of the original SNP data into data format which can be processed by using the method; secondly, performing grid division on the pre-processed SNP data, and dividing each dimension of the SNP data into three grids according to an expression value, in each sample, of each SNP site; thirdly, calculating the density of the divided grids to obtain a sub-space which comprises clusters; fourthly, clustering the obtained sub-space to obtain the classified SNP data, wherein each cluster is a set of co-expression SNP sites; and finally, storing a clustering result into a file. By adoption of the method, the problem of clustering of high-dimension classification data is solved, and the SNP data can be quickly clustered with high quality.

Description

technical field [0001] The invention relates to related technologies for clustering large-scale high-dimensional classification data, in particular to design a clustering method for SNP data, which belongs to the field of computer technology. Background technique [0002] High-dimensional data clustering has become an important research direction in data mining. Because with the advancement of technology, data collection becomes easier and easier, resulting in larger and more complex databases, such as various types of trade transaction data, Web documents, gene expression data, etc., their Dimensions (attributes) can usually reach hundreds or even thousands of dimensions, or even higher. [0003] SNP is the abbreviation of single nucleotide polymorphism, which means single nucleotide polymorphism, mainly refers to the DNA sequence polymorphism caused by the variation of a single nucleotide at the genome level. Current research indicates that there are an estimated 3 milli...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 吴悦贾敏雷州刘宗田
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products