Third-generation sequencing data error correction method, third-generation sequencing data error correction device and computer readable storage medium

A sequencing data and next-generation sequencing technology, applied in the field of bioinformatics, can solve problems such as data loss, inapplicability of low-depth data, and no quality value system

Active Publication Date: 2018-10-09
BGI TECH SOLUTIONS
View PDF8 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] All in all, the existing technical solutions have the following defects: in typical application scenarios, a large amount of data loss is usually caused; the length of the read length is shortened, which is not conducive to making full use of the advantages of the read length of the third-generation data; the error correction result is a pure sequence format, There is no quality value system, and it is impossible to evaluate the error rate of each base in the error correction result; and the self-error correction technology requires a certain depth of three generations of data to complete error correction, which is not applicable to low-depth data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Third-generation sequencing data error correction method, third-generation sequencing data error correction device and computer readable storage medium
  • Third-generation sequencing data error correction method, third-generation sequencing data error correction device and computer readable storage medium
  • Third-generation sequencing data error correction method, third-generation sequencing data error correction device and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0082] Example 1: Mixed error correction of human chromosome 22 PacBio sequencing data

[0083] The original data in this example are 68×Hiseq PE250 sequencing data and 90×Pacbio sequencing data. In this embodiment, we extracted 15×Pacbio sequencing data as example data to test the performance of the present invention under low-depth data.

[0084] 1. Using 68×Hiseq PE250 sequencing data and 15×Pacbio sequencing data, use DBG2OLC software for mixed assembly to obtain a preliminary reference genome.

[0085] 2. The statistical results of the data show that the N50 of the assembled contig (contig) is 289kb, and the self-alignment shows that there are no long-segment high-repetition sequences, and the assembly results are suitable for use as a reference sequence for alignment without any processing.

[0086] 3. Use the bwa software to compare the Hiseq sequencing data and the Pacbio sequencing data to the reference sequence.

[0087] 4. Use the maximum a posteriori model to inf...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a third-generation sequencing data error correction method, which comprises the following steps of assembling a reference genome by using second-generation sequencing data and / or third-generation sequencing data; comparing the second-generation sequencing data and the third-generation sequencing data onto the reference genome; for each basic group position on each comparison segment in the third-generation sequencing data comparison result, deducing and giving a most possible basic group type and quality value to the basic group position; and for a plurality of comparison segments in a read length and / or unmatched segments, integrating the plurality of comparison segments and / or unmatched segments into the same read length. The method has the advantages that no limitation exists on the third-generation sequencing data depth; the error correction on the low-depth third-generation sequencing data can be realized; the additional data loss and length loss of the read length are not introduced; and a quality value system of the error correction result is introduced, so that the single basic group quality of the error correction result can be evaluated.

Description

technical field [0001] The invention relates to the technical field of biological information, in particular to a method, device and computer-readable storage medium for error correction of third-generation sequencing data. Background technique [0002] The third-generation sequencing platform represented by Pacbio has the advantages of long sequencing reads (10-15k on average) and no GC bias, making it widely used in genome assembly and other aspects, but its high The error rate (15% ~ 20%) makes the complexity of the assembly algorithm greatly increased. For the assembly strategy, the error rate, length, and sequencing depth of the input reads are the main indicators that affect the assembly effect. Therefore, when using third-generation sequencing data for assembly, it is an important step in data preprocessing to make full use of data characteristics, correct errors in third-generation sequencing reads, and reduce the data error rate. [0003] The second-generation seq...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22G06F19/24
CPCG16B30/00G16B40/00
Inventor 徐煜李治鑫林哲高强霍守江肖黎
Owner BGI TECH SOLUTIONS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products