Method for rapidly and accurately identifying high-throughput genome data pollution sources

A high-throughput, genomic technology, applied in the field of molecular biology, can solve the problems of data analysis impact, unavoidable, inaccurate evaluation, etc.

Active Publication Date: 2016-07-06
广西作物遗传改良生物技术重点开放实验室
View PDF5 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, the problem of exogenous contamination of sequencing samples has always been a problem that cannot be ignored, causing great impact and obstacles to subsequent data analysis
This method can better reduce the time cost of pollution identification work, but there are also obvious problems
It is because the sampling is random that it is difficult to accurately reflect the overall contamination of the sequencing data in the pollution identification analysis based on the sampling data.
Especially for projects with very deep sequencing depth and a very large amount of sequencing data, the proportion of sampling data in the total data is very limited, and it is almost inevitable that there will be deviations between the conclusions of pollution identification and the actual pollution, or even completely wrong conclusions. , for example, there is actually pollution caused by a certain pollution source species, but because the sequencing data itself is relatively large, the sampling data does not cover the pollution data, resulting in the inability to correctly identify the pollution caused by the pollution source species
[0010] Whether it is sampling or not, the common problem with both methods is that the reads obtained by next-generation sequencing are very short, generally between 100-250bp. For the accuracy of contamination identification, the threshold set in the comparison parameters Generally, they are relatively high (mainly including the two parameters of identity and evalue, and the thresholds are set to 90% and 1e-05 respectively), and the sequences lower than this threshold in the comparison result will be considered not to be the pollution source
For sites with frequent mutations, the genetic diversity itself is relatively high, which will lead to underestimation of the pollution situation in many cases
[0011] In short, there is currently insufficient understanding of the impact of pollution problems on analysis work; and in the currently commonly used methods of contamination identification and elimination, the comparison of all sequencing data with the NT database has the disadvantages of occupying a large amount of CPU resources and taking a long time ; The sampling method has the risk of inaccurate or even wrong assessment of the actual situation of pollution due to sampling; both methods have the risk of underestimated pollution caused by excessive thresholds, which in turn affects subsequent pollution removal and subsequent biological Informatics analysis work

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for rapidly and accurately identifying high-throughput genome data pollution sources

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0036] The genome of a pathogenic fungus (Plasmoparahalstedii) was denovo sequenced. The second-generation illumina platform has two libraries of 180bp and 500bp. The sequencing depths are 35X and 34X respectively. The length of each read is 100bp. The total number of reads in each library is 46308070 and 43435185, a total of 89743255 items, with a total data volume of 8.36G, using the following methods to identify pollution sources:

[0037] (1) Assemble using ABYSS software (k-mer parameter is set to k=50, other parameters are software default parameters), the number of scaffolds in the assembly result is 30428 in total, N50 is 10506, the longest is 479848, and the size is 80M; you can It is easy to see that: ①The total number of assembled sequences is 30428, which is only 0.03% of the original total number of 89743255 sequences; ②The total data volume is 118M, which is only 1.38% of the original 8.36G total data volume. ③The sequence length is increased from 100bp to 10506 ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for rapidly and accurately identifying high-throughput genome data pollution sources. The method comprises the steps that original genome sequencing data for denovo sequencing are firstly assembled to obtain assembly results, gene prediction is conducted on the assembly results, amino acid sequences of proteins corresponding to genes are obtained through translation, and blast comparison is conducted on assembled genomic sequences and the amino acid sequences respectively with an NT database and an NR database of the NCBI to obtain homologous sequences serving as original comparison databases; species information corresponding to the sequences is extracted from the original comparison databases and is sequenced, the species corresponding to the sequences are sequenced from most to least, and whether exogenous pollution exists or not is comprehensively judged by combining with gene data results and amino acid data results. The method can reduce high-throughput genome sequencing data pollution and subsequent bioinformatics analysis influence of exogenous pollution sources in a genome denovo project to the most degree and improve pollution source identifying speed and efficiency.

Description

technical field [0001] The invention belongs to the technical field of molecular biology and relates to a method for quickly and accurately identifying pollution sources of high-throughput genome data. Background technique [0002] High-throughput sequencing technology (High-throughputsequencing), also known as "next generation" sequencing technology, can sequence hundreds of thousands to millions of DNA molecules at a time. [0003] In recent years, as the sequencing throughput of high-throughput sequencing technology continues to increase, the running time continues to shorten, the sequencing fragments continue to increase, and the cost continues to decrease, the application range of high-throughput sequencing technology is becoming wider and wider. More and more teams Choose to carry out scientific research and assisted breeding through high-throughput sequencing methods. With the sequencing of massive genetic data, more and more species have published whole-genome data, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/18
CPCG16B20/00
Inventor 曲俊杰尹玲卢江
Owner 广西作物遗传改良生物技术重点开放实验室
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products