Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Large-scale biological data clustering method and system based on spanning tree

A technology of biological data and clustering method, applied in the field of data processing, can solve the problems of not being able to fully utilize and exert the advantages of parallel computing of multi-core platforms, and achieve the effect of avoiding storage

Active Publication Date: 2022-04-29
SHANDONG UNIV
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] On the current multi-core platform processor, many applications cannot take full advantage of the parallel computing advantages of the multi-core platform

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale biological data clustering method and system based on spanning tree
  • Large-scale biological data clustering method and system based on spanning tree
  • Large-scale biological data clustering method and system based on spanning tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0043] This embodiment discloses a large-scale biological data clustering method, including:

[0044] Step 1: Construction of Sketch. This step is mainly to generate k-mers from the original biological genome sequence through a sliding window to obtain a set of k-mers, and then map the k-mers in the set into corresponding hash values ​​through a hash function, and select them by the minHash method Among them, the smallest fixed number (1000) of hash values ​​is saved as a sketch, and this fixed number is the dimension of the sketch. Since the hash function that maps k-mer to hash value satisfies uniformity, that is, the hash value mapped to k-mer is equally distributed in the corresponding hash value space, so the selected minimum fixed number The hash value of the number is equivalent to a fixed number of k-mers randomly selected from all k-mers. And all genome sequences use the same hash function, which ensures that the same k-mer input has the same hash value output, so t...

Embodiment 2

[0062] The purpose of this embodiment is to provide a large-scale biological data clustering system, including:

[0063] A similarity estimation module for estimating the similarity between genome sequences;

[0064] The minimum spanning tree generation module is used to calculate the distance matrix between the genome sequences and construct the minimum spanning tree based on the similarity between the estimated genome sequences, and generate the minimum spanning tree by dividing the distance matrix into subgraphs and constructing the sub-minimum spanning tree Tree;

[0065] The clustering module is used to prune edges exceeding a given threshold length in the minimum spanning tree to generate clustering results.

[0066] Those skilled in the art should understand that each module or each step of the present invention described above can be realized by a general-purpose computer device, optionally, they can be realized by a program code executable by the computing device, th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a large-scale biological data clustering method and system based on a spanning tree, belongs to the technical field of data processing of large-scale genome data, and solves the problem of low calculation efficiency at present. Streaming a distance matrix between the genomic sequences and constructing a minimum spanning tree based on the similarity between the estimated genomic sequences, generating the minimum spanning tree by dividing the distance matrix into sub-graphs and constructing sub-minimum spanning trees; and cutting off edges exceeding a given threshold length in the minimum spanning tree to generate a clustering result. According to the method, the sketch algorithm is adopted to estimate the similarity between the sequences, and the dimension of a k-mer set in sketch is far smaller than that of an original sequence, so that the calculation time and space occupation of sequence similarity analysis by adopting the sketch algorithm are far smaller than those of direct accurate comparison of original data.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a spanning tree-based large-scale biological data clustering method and system. Background technique [0002] The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art. [0003] With the growth of gene sequencing technology and the reduction of sequencing costs, the scale of biological genome data is getting larger and larger, and the overall scale is growing exponentially. The reference genome data scale of the refseq database of the famous genome database NCBI has reached the TB level. In the near future, it may reach the level of PB or even higher. For large-scale data sets, corresponding algorithms are needed to solve large-scale data problems. [0004] Clustering algorithm is a commonly used algorithm in the processing of biological big data. Its main approach is to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G16B40/30G16B30/10G16B50/30
CPCG16B40/30G16B30/10G16B50/30
Inventor 刘卫国徐晓明殷泽坤
Owner SHANDONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products