Large-scale biological data clustering method and system based on spanning tree

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of biological data and clustering method, applied in the field of data processing, can solve the problems of not being able to fully utilize and exert the advantages of parallel computing of multi-core platforms, and achieve the effect of avoiding storage

Active Publication Date: 2022-04-29

SHANDONG UNIV

View PDF6 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0009] On the current multi-core platform processor, many applications cannot take full advantage of the parallel computing advantages of the multi-core platform

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0043] This embodiment discloses a large-scale biological data clustering method, including:

[0044] Step 1: Construction of Sketch. This step is mainly to generate k-mers from the original biological genome sequence through a sliding window to obtain a set of k-mers, and then map the k-mers in the set into corresponding hash values through a hash function, and select them by the minHash method Among them, the smallest fixed number (1000) of hash values is saved as a sketch, and this fixed number is the dimension of the sketch. Since the hash function that maps k-mer to hash value satisfies uniformity, that is, the hash value mapped to k-mer is equally distributed in the corresponding hash value space, so the selected minimum fixed number The hash value of the number is equivalent to a fixed number of k-mers randomly selected from all k-mers. And all genome sequences use the same hash function, which ensures that the same k-mer input has the same hash value output, so t...

Embodiment 2

[0062] The purpose of this embodiment is to provide a large-scale biological data clustering system, including:

[0063] A similarity estimation module for estimating the similarity between genome sequences;

[0064] The minimum spanning tree generation module is used to calculate the distance matrix between the genome sequences and construct the minimum spanning tree based on the similarity between the estimated genome sequences, and generate the minimum spanning tree by dividing the distance matrix into subgraphs and constructing the sub-minimum spanning tree Tree;

[0065] The clustering module is used to prune edges exceeding a given threshold length in the minimum spanning tree to generate clustering results.

[0066] Those skilled in the art should understand that each module or each step of the present invention described above can be realized by a general-purpose computer device, optionally, they can be realized by a program code executable by the computing device, th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a large-scale biological data clustering method and system based on a spanning tree, belongs to the technical field of data processing of large-scale genome data, and solves the problem of low calculation efficiency at present. Streaming a distance matrix between the genomic sequences and constructing a minimum spanning tree based on the similarity between the estimated genomic sequences, generating the minimum spanning tree by dividing the distance matrix into sub-graphs and constructing sub-minimum spanning trees; and cutting off edges exceeding a given threshold length in the minimum spanning tree to generate a clustering result. According to the method, the sketch algorithm is adopted to estimate the similarity between the sequences, and the dimension of a k-mer set in sketch is far smaller than that of an original sequence, so that the calculation time and space occupation of sequence similarity analysis by adopting the sketch algorithm are far smaller than those of direct accurate comparison of original data.

Description

technical field [0001] The invention belongs to the technical field of data processing, and in particular relates to a spanning tree-based large-scale biological data clustering method and system. Background technique [0002] The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art. [0003] With the growth of gene sequencing technology and the reduction of sequencing costs, the scale of biological genome data is getting larger and larger, and the overall scale is growing exponentially. The reference genome data scale of the refseq database of the famous genome database NCBI has reached the TB level. In the near future, it may reach the level of PB or even higher. For large-scale data sets, corresponding algorithms are needed to solve large-scale data problems. [0004] Clustering algorithm is a commonly used algorithm in the processing of biological big data. Its main approach is to...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G16B40/30G16B30/10G16B50/30

CPCG16B40/30G16B30/10G16B50/30

Inventor 刘卫国徐晓明殷泽坤

Owner SHANDONG UNIV

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Large-scale biological data clustering method and system based on spanning tree

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology