Unlock instant, AI-driven research and patent intelligence for your innovation.

A method to improve the assembly efficiency of metagenomic nanopore sequencing data strains

A technology of metagenomics and sequencing data, applied in the field of bioinformatics analysis, which can solve the problems of increased difficulty in assembly, long assembly running time, and increased assembly running time, so as to improve assembly efficiency, improve identification efficiency, and ensure validity and accuracy Effect

Active Publication Date: 2022-08-05
SIMCERE DIAGNOSTICS CO LTD +2
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] However, in practice, the sequencing reads data scale is large, the assembly run time is long, and the utilization rate of reads is low.
Specifically, because metagenomic sequencing is aimed at all microbial sequences in complex environments, due to the diversity of species and the high sequence similarity of closely related species, it will increase the difficulty of assembly, thereby increasing the assembly run time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method to improve the assembly efficiency of metagenomic nanopore sequencing data strains
  • A method to improve the assembly efficiency of metagenomic nanopore sequencing data strains
  • A method to improve the assembly efficiency of metagenomic nanopore sequencing data strains

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0069] Example 1 Construction of this patented method

[0070] The focus of this patent is that, after the metagenomic data is pre-grouped, the assembly efficiency is improved based on the assembly of the grouped reads.

[0071] The process of method optimization

[0072] Two aspects need to be explained first: from the sequence of reads to the 5-mer frequency matrix, and the cluster label obtained for each read.

[0073] In the specific calculation,

[0074] 1. First calculate the 5-mer frequency matrix based on the reads sequence:

[0075] The number of sequence types of -5-mer is 4*4*4*4*4 / 2=512;

[0076] - Calculate the frequency of these 512 5-mers in each read;

[0077] - get a 5-mer frequency matrix;

[0078] 2. Then use Umap to reduce the dimension based on the frequency matrix, and use hdbscan to assign a cluster label to each read.

[0079] 3. Then use Canu / meta-Flye software to assemble for each cluster.

[0080] 4. Finally, use blast to compare the assembly ...

Embodiment 2

[0093] Embodiment 2 Umap grouping effect of the patented method

[0094] The present invention makes the zymo official ONT sequencing data grouped under different time / data volume gradients by means of pre-grouping, and reads from the same species tend to be grouped into the same cluster. process proceeds.

[0095] For the results of dimensionality reduction after Umap grouping, see Figure 2-6 , figure 2 is the dimensionality reduction clustering result graph of 1h, image 3 is the dimensionality reduction clustering result graph of 2h, Figure 4 is the dimensionality reduction clustering result graph of 3h, Figure 5 is the dimensionality reduction clustering result graph of 4h, Image 6 is the dimensionality reduction clustering result graph of 5h. It can be seen that all reads are divided into different clusters by pre-clustering.

Embodiment 3

[0096] Example 3 Evaluation of the assembly efficiency of the patented method

[0097] The present invention significantly improves the assembly efficiency of zymo official ONT sequencing data under different time / data volume gradients, such as 1h-5h base data volume, by means of pre-grouping. The specific implementation is based on the flow of Example 1.

[0098] The assembly time results are shown in Table 1. It can be seen that the assembly time using Umap pre-grouping is shortened by nearly half.

[0099] Table 1

[0100] time base(bp) Assembly (no_Umap) Assembly time (Umap) 1h 458,473,600 45m47.655s 14m36.602s 2h 919,961,649 503m13.250s 36m54.974s 3h 1,375,306,551 749m23.833s 65m43.655s 4h 1,796,485,159 1126m10.946s 154m36.229s 5h 2,205,881,698 1359m9.468s 179m0.873s

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention provides a method for improving the assembly efficiency of metagenomic sequencing data by clustering by dimensionality reduction. The method uses k-mer frequency or frequency statistics to perform dimension reduction pre-grouping before assembly, which can significantly improve metagenomic assembly. efficiency, the assembly time is at least reduced by more than half, while ensuring the validity and accuracy of the bioinformatics identification.

Description

technical field [0001] The invention relates to the field of bioinformatics analysis, in particular to a method for improving the assembly efficiency of metagenomic nanopore sequencing data strains through dimension reduction. Background technique [0002] Metagenomics (metagenomics, also known as metagenomics) is the genomics study of microorganisms in their original habitats. Metagenomics directly extracts the DNA or RNA of all microorganisms from environmental samples, constructs a metagenomic library and sequences it, and systematically analyzes the genetic diversity and functional diversity of microorganisms in the environment to explore the fields of taxonomy, function and evolution. Metagenomics allows us to go beyond the limitations of culturability and taxonomic properties to directly investigate the genetic composition of microbial communities such as bacteria, viruses and fungi. The analysis content of metagenomics mainly includes species composition and differen...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G16B20/20
CPCG16B20/20
Inventor 李振中陈莉李珊戴岩李诗濛任用
Owner SIMCERE DIAGNOSTICS CO LTD