Sequence clustering method and device

A sequence clustering and sequence technology, which is applied in the field of sequence clustering methods and devices, can solve the problems of poor clustering quality, limited performance of single-threaded characteristics, low calculation efficiency of ESPRIT, etc., and achieve the effect of good scalability

Inactive Publication Date: 2015-04-22
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
View PDF1 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, efficient parallelization of hierarchical clustering is inherently difficult. HPC-Clust[2] and ESPRIT[3] are currently the only two parallel methods for sequence clustering. HPC-Clust adopts profile-based alignment to reduce Computational burden, the clustering quality is poor when dealing with unknown classification, and ESPRIT is accurate but the calculation efficiency is low, it takes almost a w

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Sequence clustering method and device
  • Sequence clustering method and device
  • Sequence clustering method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are some of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0050] Please refer to figure 1 , figure 1 It is a flowchart of the first embodiment of a sequence clustering method proposed by the present invention. As shown in the figure, the method in the embodiment of the present invention includes:

[0051] S101, each computing node reads the entire sequence data and builds a PBP tree independently.

[0052] In a specific implementation, the PBP tree contains multiple levels, and the balance interval from 0....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The embodiment of the invention discloses a sequence clustering method. The method comprises the steps that each computing node reads whole sequence data and builds a PBP tree independently; the nearest neighbour (NN) of each sequence is searched for concurrently; all pairs (a, b) meeting NN(a)=b and NN(b)=a are selected; each pair is deleted from the PBP tree and is combined into a new cluster, and all the newly-built clusters are inserted into the PBP tree; the newly-built clusters and all affected NN lists of the clusters are updated concurrently through NN searching until only one cluster is left or cluster distance exceeds a given threshold value. The embodiment of the invention further discloses a sequence clustering device. By the adoption of the sequence clustering method and device, hierarchical clustering of millions of sequences can be processed efficiently and accurately, the clustering result equivalent to standard hierarchical clustering is achieved, and the expansibility is good.

Description

technical field [0001] The invention relates to the field of data mining, in particular to a sequence clustering method and device. Background technique [0002] Genome sequencing is a commonly used tool in biological and biomedical research. In the past few years, the data generation capacity of high-throughput sequencing has improved significantly, and the speed exceeds Moore's Law. The rapid accumulation of genetic information is a valuable resource for biological knowledge mining, and at the same time poses a severe challenge to data analysis, requiring new computing algorithms for efficient data processing. Amplified sequencing, one of the two main genome sequencing methods currently, is a powerful tool for in-depth analysis of phylogenetic and evolutionary details, especially for viruses, microbial groups (including bacteria, fungi, and plankton), etc. A critical step in the analysis of amplified sequencing data is the clustering of sequences into taxonomic or genoty...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/24
Inventor 蔡云鹏杨玉洁樊小毛
Owner SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products