Method for performing full genome sequence hole filling by means of long sequencing read segment

A whole-genome and sequencing technology, applied in the field of genomics and bioinformatics, can solve problems such as large memory requirements, long running time, and limited applications

Active Publication Date: 2019-03-01
CHINESE ACAD OF FISHERY SCI
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Aligning long sequencing reads to the genome sequence and performing local assembly in hole regions is time-consuming, resulting in long run times and memory requirements, limiting their application, especially on large genomes

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for performing full genome sequence hole filling by means of long sequencing read segment
  • Method for performing full genome sequence hole filling by means of long sequencing read segment

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0153] Using C. elegans PacBio sequencing data to fill holes in the C. elegans genome

[0154] Materials: From the website of the National Center for Biotechnology Information (NCBI, National Center for Biotechnology Information) ( https: / / www.ncbi.nlm.nih.gov / ) to download the next-generation genome sequencing data of C. elegans (NCBI SRA database numbers: DRR023912 and DRR023913). PacBio raw sequencing data of C. elegans were downloaded from the website (http: / / datasets.pacb.com.s3.amazonaws.com / 2014 / c_elegans / list.html). The obtained data format is fastq format, about 7.4G data volume, and the average read length is 10,958 bases. Preprocess the downloaded PacBio data and convert the files in fastq format to fasta format. First, the next-generation sequencing data was assembled with Platanus (platanus.bio.titech.ac.jp) to obtain the nematode genome. The genome size was 95.5Mb, including 4256 holes, and the total length of the holes was 3.6Mb. In order to fill the unknow...

Embodiment 2

[0164] Using C. elegans Nanopore sequencing data to fill holes in the C. elegans genome

[0165] Materials: Same as Example 1, download the next-generation genome sequencing data of C. elegans and assemble it into a genome. Genome size and number of holes are consistent with Example 1. From the National Center for Biotechnology Information website ( https: / / www.ncbi.nlm.nih.gov / ) to download the Nanopore raw sequencing data of C. elegans (NCBI SRA database numbers: ERR2092776 and ERR2092777). The amount of data is about 9.9G, and the average read length is 11,537 bases. In order to fill the unknown bases in the hole, follow the steps below to fill the hole, and perform three iterations to fill the hole.

[0166] 1. Break each Nanopore long sequencing read into equal-length 300-base index fragments.

[0167] 2. Align the tag fragments to the genome with bwa-mem, and run bwa-mem with the parameter -k14-W20-r10-A1-B1-O1-E1-L0.

[0168] 3. The remaining steps are the sam...

Embodiment 3

[0173] Using Yeast Nanopore Sequencing Data to Fill in Holes in the Yeast Genome

[0174] Materials: From the website of the National Center for Biotechnology Information (NCBI, National Center for Biotechnology Information) ( https: / / www.ncbi.nlm.nih.gov / ) to download the next-generation genome sequencing data of yeast (NCBI SRA database numbers: ERR225691, ERR225692, and SRR507778). In addition, download the original Nanopore sequencing data of yeast from this website (SRA database numbers: ERR1883389, ERR1883402, ERR1883399, ERR1883400 and ERR1883401). The obtained data format is fastq format, about 440M data volume, and the average read length is 8,000 bases. Preprocess the downloaded Nanopore data and convert the files in fastq format to fasta format. First, the next-generation sequencing data was assembled with Platanus (platanus.bio.titech.ac.jp) to obtain the yeast genome. The genome size was 11.27Mb, including 472 holes, and the total length of the holes was 290.8...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for performing full genome sequence hole filling by means of a long sequencing read segment. The method comprises the following steps of 1, dividing the long sequencing read segment to a plurality of label fragments which are successively connected, then comparing the plurality of label segments with the full genome which requires hole filling; 2, determining a comparing direction and a comparing position of the label segment which matches the full genome sequence on the full genome sequence; 3, according to the comparing position of the label segment and the position relation of the hole, establishing an association relation between the corresponding long sequencing read segment and the hole; and 4, according to the association relation between the long sequencing read segment and the hole, and performing hole filling on the unknown sequence of the hole by means of the long sequencing read segment. According to the method of the invention, a short readsegment comparison method is applied to a long sequencing read segment comparison genome, the short read segment is compared with the genome sequence, thereby realizing short operation time and low memory requirement, and improving hole filling speed and memory requirement of the long segment.

Description

technical field [0001] The invention belongs to the field of genomics and biological information technology, and in particular relates to a method for filling holes in DNA assembly, in particular to a method for filling holes in a genome sequence by using long sequencing reads to fill holes in a whole genome sequence. Background technique [0002] Next-generation sequencing technologies allow for the low-cost and rapid construction of genome sequences by de novo assembly. However, factors such as sequencing errors, repetitive regions, heterochromatin, genomic polymorphisms, and the preference of next-generation sequencing make it difficult for some genomic regions to be assembled, manifested as gaps between genomic sequences, generally replaced by a string of Ns. The length of N represents the size of the hole. The process of filling the holes in the genome. To fill the gaps between genome sequences to construct a complete genome, several methods have been developed to use ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G16B30/00
Inventor 李炯棠徐桂彩朱锐张研李尚琪孙晓晴
Owner CHINESE ACAD OF FISHERY SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products