Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for clustering metagenome sequences by using double-layer probability model

A probabilistic model and metagenomic technology, which is applied in the field of clustering metagenomic sequences using a double-layer probability model, can solve problems such as high pollution, low genome integrity, and inability of the probability model to fit features well, so as to improve clustering quality effect

Pending Publication Date: 2022-05-06
XI AN JIAOTONG UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing clustering methods do not use the above three features reasonably and effectively, which is manifested in the unreasonable weight setting of different features, the inability of the probability model to fit the features well, and the imperfect method construction, etc.
Although the existing methods have achieved good results on simulated data and real samples, there is still a big gap from the ideal results, mainly due to the low integrity and high pollution of the genome generated by clustering.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for clustering metagenome sequences by using double-layer probability model
  • Method for clustering metagenome sequences by using double-layer probability model
  • Method for clustering metagenome sequences by using double-layer probability model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention will be further described in detail below in conjunction with specific embodiments, which are explanations of the present invention rather than limitations.

[0032] A method for clustering metagenomic sequences using a bilevel probability model, comprising the steps of:

[0033] P1, clustering all sequences in the initial metagenome using a first-level probabilistic model (DPGMM model) to obtain multiple primary clusters.

[0034] P11, obtain the k-mer frequency feature vector and coverage feature vector of all sequences from the initial metagenome, where k-mer represents an oligonucleotide with a length of k, and the value of k in the k-mer frequency feature vector is 4 ;

[0035] P12, merge the k-mer frequency feature vector and coverage feature vector into a single feature vector;

[0036] P13, use the Dirichlet process to construct the DPGMM model, and use the eigenvectors merged in P12 as input for clustering;

[0037] P14, use the variati...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to the technical field of metagenome sequence clustering, in particular to a method for clustering metagenome sequences by using a double-layer probability model, which comprises the following steps: P1, clustering all sequences in an initial metagenome by using a first-layer probability model to obtain a plurality of primary clusters; p2, clustering each primary cluster again by using a second-layer probability model to obtain a final cluster; the second-layer model comprises a seed selection model, a k-mer frequency probability model and a coverage probability model. According to the method for clustering the metagenome sequences by using the double-layer probability model, all the sequences in the initial metagenome are processed by using the double-layer model, so that the characteristics of different dimensions of the metagenome sequences can be effectively utilized, and the method is suitable for all metagenome sequencing data; the data include intestinal microorganism data, soil microorganism data, water microorganism data and the like.

Description

technical field [0001] The invention relates to the field of metagenomic sequence clustering technology, in particular to a method for clustering metagenomic sequence by using a double-layer probability model. Background technique [0002] As the oldest existing life form on the earth, prokaryotes play an important role in the process of biological evolution, material cycle and environmental change on the earth. Mapping the complete genomes of bacteria and archaea is the basis for the study of their systematic classification, and it is also the basis for the study of microbial evolution, Key data on ecology and community structure and function. With the development of sequencing technology and metagenomic assembly technology, it is possible to recover microbial sequences from sequencing data, but limited by the complexity of real environmental samples, metagenomic sequences obtained by sequence assembly are highly fragmented. It is an effective way to match the sequences in...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B40/00
CPCG16B40/00
Inventor 杨铁林刘聪聪郭燕董珊珊
Owner XI AN JIAOTONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products