Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Metagenome sequence deep clustering method based on reference species tag constraint

A technology of metagenomic and clustering methods, applied in the field of deep clustering of metagenomic sequences based on reference species label constraints, can solve the problems of high similarity of adjacent species and inaccurate clustering, and achieve the effect of excellent clustering performance

Pending Publication Date: 2022-02-18
JILIN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of the deficiencies in the prior art above, the purpose of the present invention is to provide a method for deep clustering of metagenomic sequences constrained by reference species labels, aiming to solve the problem of the same genus when performing clustering of metagenomic DNA sequences in the prior art. The problem of inaccurate clustering caused by the high similarity of adjacent species

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Metagenome sequence deep clustering method based on reference species tag constraint
  • Metagenome sequence deep clustering method based on reference species tag constraint
  • Metagenome sequence deep clustering method based on reference species tag constraint

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0178] In this embodiment, the test set includes two data sets, the Sharon simulation data set and the Strain simulation data set. The Sharon simulation data set contains 37628 contigs sequences of 101 species, which are simulated based on the first 96 data sets of the HMP project. The Strain simulation data set is the ability to test the taxonomic resolution of different species, containing 9401 contigs sequences of 20 species, including strains of the same species. The 20 species consisted of 5 different strains of Escherichia coli, 5 species of Bacteroides, 5 strains from different Clostridium species, and 5 strains of other typical intestinal bacteria.

[0179] Utilize the metagenome deep clustering method (Label-constrained deep clustering, LCDC) provided by the present invention to carry out analysis, and simultaneously use current similar analysis method COCACOLA, CONCOCT, MetaBAT and MaxBin2.0 as contrast, the analysis result is as follows Figure 5 , Figure 6 shown....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a metagenome sequence deep clustering method based on reference species tag constraints. A deep learning pre-training model based on the reference species tag constraints is designed. According to the method, a pre-training database of known species based on different communities is established, each 4mer feature vector is divided into three conditions of the same species, different species of the same genus and different species of the different genus when the pre-training database is established, and relationships among 4mer features of sequences among samples under the three conditions are researched respectively; a label constraint error function of a pre-training model is established, pre-training is performed by using a database of known labels of the community, and different pre-training models are constructed for different microbial communities; when a user uses the method, the user only needs to load the pre-training model of the required community for different communities, and the clustering result can be obtained only by waiting for iteration of the fine tuning step for several times when the model is reloaded. Finally, the clustering method can show very excellent clustering performance.

Description

technical field [0001] The invention relates to the field of bioinformatics analysis, in particular to a method for deep clustering of metagenomic sequences based on reference species label constraints. Background technique [0002] Microorganisms are the largest, most numerous, and most widely distributed group of organisms on earth. People's research on microorganisms is mainly based on pure culture, but it was later found that more than 99% of microorganisms are not cultivable. In order to study microorganisms that cannot be cultivated, a new concept - metagenomics came into being. Metagenomics uses next-generation sequencing technology to obtain most of the genetic material in the environment without laboratory cultivation. Different from traditional sequencing methods, the raw data obtained by metagenomic sequencing are a large number of short DNA fragments derived from a variety of microorganisms. According to the overlapping relationship between DNA fragments, rese...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N3/00G06N3/04G06N3/08
CPCG06N3/084G06N3/006G06N3/045G06F18/23G06F18/214Y02A90/10
Inventor 刘富刘威刘云苗岩侯涛宋文智余芳宇
Owner JILIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products