High efficiency clustering method based on locality-sensitive hashing and non-parametric Bayes method

A local sensitive hashing, non-parametric technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of inaccurate greedy algorithm results, results dependent on sequence input order, etc., to achieve high accuracy , estimated accurate effect

Active Publication Date: 2016-12-14
TSINGHUA UNIV
View PDF3 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The technical problem to be solved by the present invention is to provide an efficient and accurate gene sequence clustering method, overcome the problem that the result of the greedy algorithm is inaccurate, and the result

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High efficiency clustering method based on locality-sensitive hashing and non-parametric Bayes method
  • High efficiency clustering method based on locality-sensitive hashing and non-parametric Bayes method
  • High efficiency clustering method based on locality-sensitive hashing and non-parametric Bayes method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] The clustering algorithm based on locality-sensitive hashing (LSH) and non-parametric Bayesian method (DP-means) involved in this embodiment will be described in detail below. Core algorithm pseudocode of the present invention sees figure 1 ; The flow chart of the clustering algorithm based on locality-sensitive hashing and non-parametric Bayesian method proposed by the present invention is shown in figure 2 .

[0035] S1. Remove all repetitive sequences in the data set, and convert the gene sequence to a dimension of 4 K The k-mer count vector

[0036] Since there may be a large number of identical gene sequences in general metagenomic 16s rRNA samples, the abundance of a microorganism can be known through the number of these sequences. At the same time, in a cluster, the sequence with higher abundance is more likely to be a cluster. class center. Therefore, we set a weight for each sequence, and the initialization of the weight is the number of repeated sequences...

Embodiment 2

[0056] Example 2 Cluster Analysis of Simulation 16s RNA Gene Data Set

[0057] In order to compare the clustering results of the algorithm with the ground truth of the data set, we conducted comparative experiments on the simulation data set. The simulation data set is generated by the software Grinder. There are 5 groups in total. The parameters of each group are shown in Table 1 below:

[0058] Table 1 Simulation data set generation parameters

[0059]

[0060] The visualization result of the clustering result on Sim5 by the method of embodiment 1 is as follows Figure 3-A shown. Figure 3-A The area of ​​each circle in is proportional to the size of the cluster (the number of sequences in the cluster), the left of each pair of circles represents the ground truth, the right represents the clustering result of this algorithm, and the overlapping area represents both intersection of . This illustration shows the comparison results of the largest 60 clusters in Sim5 (acc...

Embodiment 3

[0061] Example 3 Cluster Analysis of Taihu Lake Microbial 16s rRNA Metagenome Data Set

[0062] This data set is a 16s rRNA metagenomic data set collected for the study of water pollution in Taihu Lake. It contains 81 water surface microbial samples collected in 9 different months in 2012. The entire dataset contains 316,153,464 original sequences, the sequence length is 80bp, and the file size is 30GB.

[0063] Figure 4-A It shows the time required for the method (DACE) of Embodiment 1 to process various data scales under different numbers of CPU cores. It can be seen that by using the message passing interface and multi-threading technology, the algorithm (DACE) of Embodiment 1 can effectively utilize computing resources, and greatly reduce the running time when the number of CPU cores increases. In particular, when the data size is large, the running time can be proportionally reduced as the number of CPU cores increases. It can be seen that the scalability (Scalability...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a high efficiency clustering method based on a locality-sensitive hashing (LSH) and non-parametric Bayes method; the high efficiency clustering method can effectively process mass sequence data, such as 16s rRNA and 18s rRNA data; a high efficiency partitioning iteration method can prevent contrast of mass non-similar sequences, so the clustering result can be fast given for a large scale data set clustering problem; the high efficiency clustering method is the most efficient method for processing the large scale clustering problem in existing bioinformation field; in addition, a DP-means algorithm can more accurately estimate the cluster center, so the clustering result by the novel method can ensure very high accuracy.

Description

technical field [0001] The invention belongs to the field of computer application (bioinformatics), and in particular relates to a high-efficiency clustering method based on local sensitive hash and non-parametric Bayesian method. Background technique [0002] In recent years, with the rapid development of second-generation sequencing technology, people can quickly and cheaply obtain a large number of high-quality DNA / RNA sequencing fragments from environmental samples. The Earth Microbiome Project obtained 1.3 billion biological 16s rRNA gene sequences from 15,000 environmental samples around the world; the Human Gut Microbiome Project team obtained 1.1 billion 16s rRNA gene sequences from 531 test samples . Through this large-scale sequencing, people can obtain a lot of useful information for governing the environment, studying diseases, developing drugs, and so on. When analyzing these data, the most basic task is to cluster the sequences, clustering the single gene seq...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/24
CPCG16B40/00
Inventor 陈宁陈挺蒋林浩
Owner TSINGHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products