Splicing method and system of second generation and third generation genomic sequencing data combination

A technology of genome sequencing and splicing system, which is applied in the splicing method and system field of second-generation and third-generation genome sequencing data, which can solve the difficulties of splicing prokaryotes and eukaryotes, reduce sequencing time and cost, and solve difficult repetitive sequences Area and other issues

Inactive Publication Date: 2015-09-30
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF4 Cites 40 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] DNA sequencing technology has mainly experienced three stages of development, namely first-generation sequencing technology, second-generation sequencing technology and third-generation sequencing technology. The first-generation sequencing technology is the dideoxy chain termination reaction sequencing invented by Sanger in 1977. Using the improved Sanger sequencing method, researchers completed almost all the sequencing of the Human Genome Project (HGP, 1995-2003); the second-generation sequencing technology was born in the early 21st century, and the representative instrument is 454 , Illumina and ABI have successively launched next-generation sequencers (ie, second-generation sequencers). These sequencers can perform a large number of sequencing reactions in parallel at the same time, thereby greatly reducing sequencing time and cost. Compared with traditional sequencing methods, the first The obvious advantage of next-generation sequencing technology is high sequencing throughput. For example, the SOLiD3 sequencer can obtain 20GB of sequencing data in a single run. The disadvantage is that the DNA read length generated is much shorter than that of Sanger sequencing. For example, the read length generated by Sanger sequencing can It reaches 900bp, while the read length of 454 sequencer is 250-400bp, and the read length of Solexa is 50-75bp. The short sequence length makes it difficult for the splicing algorithm to solve the repetitive sequence region, resulting in fragmented splicing. In addition, the second-generation sequencing technology The error rate is also higher; the third-generation sequencing technology began in 2008, which is characterized by the use of "single-molecule sequencing" strategy for sequencing, mainly including BioScience's HeliScope single-molecule sequencing technology, Pacific Biosciences' single-molecule real-time sequencing technology and The nanopore nanopore single-molecule sequencing technology of Oxford Nanopore Technology Ltd. The remarkable feature of the single-molecule sequencing technology is that it no longer amplifies the sample, and ensures the uniform coverage of the sequencing data (reads) on the genome to the greatest extent. The reads generated by molecular sequencing are as long as 3kb to 20kb. Its potential advantage is that it can solve the splicing of long repetitive sequences. The disadvantage is that the reads have a high error rate (about 5% to 15%)
Due to the limitations of the length of second-generation sequencing data and the error rate of third-generation sequencing data, it is still difficult to completely assemble prokaryotes and eukaryotes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Splicing method and system of second generation and third generation genomic sequencing data combination
  • Splicing method and system of second generation and third generation genomic sequencing data combination
  • Splicing method and system of second generation and third generation genomic sequencing data combination

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] The following are the concrete steps of the present invention, as follows:

[0048] Step 1: Use the next-generation sequencing data to form a de Bruijn map. The second-generation sequencing data generally contains reads quality information. First, the quality information is used to preprocess the sequencing data to remove low-quality fragments, and then the reads are broken into kmers of the same length (k-mer refers to a reads, continuous Cutting, a sequence of nucleotide sequences with a length of K obtained by scratching one by one), constructing a de Bruijn graph, ARCS23 in the process of reading reads, generating a kmer according to the kmer length k input by the user, and saving it to the hash table , record the number of occurrences of kmer. In the implementation code of SOAPdenovo2, 1 byte is used to represent the number of occurrences of kmer. This method can only save up to 255. In the second-generation sequence splicing, the general sequencing depth will be r...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of biology information technology and computational biology, in particular to a splicing method and system of second generation and third generation genomic sequencing data combination. The splicing method of the second generation and third generation genomic sequencing data combination comprises the steps that second generation genomic sequencing data are obtained, preprocessing is performed on the second generation genomic sequencing data through quality information of part base sequence reads in the second generation genomic sequencing data, and a de Brui jn graph is built; sequencing error processing is performed on the de Brui jn graph to generate a new de Brui jn graph, compression is performed on the new de Brui jn graph to generate a compressed de Brui jn graph, and sequence multiplicity of a compressed edge of the compressed de Brui jn graph is obtained; third generation genomic sequencing data are obtained, the third generation genomic sequencing data are posted onto a single molecule graph gapped fragments of the second generation genomic sequencing data, the compressed de Brui jn graph is dismantled through the optimal configuration, and gaps in optimal configuration are filled to complete the splicing of the genomic sequencing data.

Description

technical field [0001] The invention relates to the fields of biological information technology and computational biology, in particular to a splicing method and system for the joint use of second-generation and third-generation genome sequencing data. (There is no need to add the feature of the present invention here, so delete it) Background technique [0002] A genome is all the genetic information in an organism contained within DNA (RNA for some viruses). DNA is a complementary double strand composed of four bases: A, C, T, and G. According to the "central dogma" of biology, the base sequence of DNA guides the transcription of RNA and the further process of protein translation and synthesis. Therefore, Understanding the base sequence of DNA is an important basis for understanding biological laws. Partial base sequences (reads) of DNA are obtained through sequencing technology and used to assemble a complete genome sequence for further analysis and research. [0003] D...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22
Inventor 卜东波张仁玉陈挺李帅成孙世伟刘兴武许情郑全刚王超
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products