Scaffolding method based on long readings and contig classification

A reading and partial technology, applied in the field of sequence assembly in bioinformatics, can solve problems such as high sequencing error rate, noisy comparison information, and affecting the accuracy of scaffolding

Inactive Publication Date: 2018-11-16
HENAN POLYTECHNIC UNIV
View PDF7 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The long reads generated by third-generation sequencing technology can reach tens of thousands of bases, so long reads can span most of the repeated regions, but the sequencing error rate is too high
This increases the connectivity of the scaffold graph, which in turn affects the accuracy of scaffolding
[0011] (2) When using the alignment information between long reads and contig, because the sequencing error rate of long reads is relatively high, there is more noise in the alignment information between long reads and contig
[0012] (3) Existing scaffolding methods often assume that each contig can only appear once in the scaffold
[0013] The existence of these problems limits the existing scaffolding methods to achieve more satisfactory results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Scaffolding method based on long readings and contig classification
  • Scaffolding method based on long readings and contig classification
  • Scaffolding method based on long readings and contig classification

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0061] like figure 1 As shown, the specific implementation process of the present invention is as follows:

[0062] 1. Generate a local scaffold set

[0063] 1.1 This method takes contig files and long read files as input data. First, use the alignment tool BWA to align the long reads to the contig to obtain the alignment result. where only lengths greater than L are considered r of long reads and lengths greater than L c contig, L r =500, L c = 3000.

[0064] 1.2 If a long read and a contig can be aligned, the position and alignment direction of the alignment interval can be obtained. Assuming the jth long read (lr j ) and the ith contig (c i ) can be compared, it means lr j The previous interval can be compared to c i an interval above. This method uses SPR (c i ,lr j ) in lr j The starting position of the above alignment interval, EPR(c i ,lr j ) in lr j The end position of the upper alignment interval, SPC(c i ,lr j ) represents the starting position of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a scaffolding method based on long readings and contig classification. According to the method, firstly, long readings are compared to a contig set, and a local scaffold set isgenerated according to comparison results; a local scaffold is composed of contig compared to the same long reading; based on position information of each contig emerging in the local scaffolds, allcontig are divided into two categories, namely repeating contig and non-repeating contig; a scaffold diagram only consisting of non-repeating contig is constructed, in which each node represents a non-repeating contig; then, a linear programming method is utilized to eliminate the orientation and sequence conflict in the scaffold diagram, so that the scaffold diagram only includes simple paths, each simple path corresponds to one scaffold; then, the repeating contig are inserted into the scaffold to form final scaffolding results. The method is simple and easy to use, shows good scaffolding results through different real data, and has high accuracy and continuity compared than other scaffolding methods.

Description

technical field [0001] The present invention relates to the field of sequence assembly of bioinformatics, in particular to a scaffolding method based on long reads and contig classification. Background technique [0002] Genome generally refers to all coding and non-coding deoxyribonucleic acid (DNA) sequences, which are composed of four bases: adenine (A), thymine (T), cytosine (C) and guanine (G) The sequence, that is, the genome sequence is a string that contains only four characters A, T, G, C. Another character N is also included in the actual genome sequence, and the base representing that position cannot be determined. Genomic DNA sequences contain genetic and regulatory information that guides biological development and the functioning of life functions. Complete and correct genomic DNA sequences have become indispensable knowledge in basic biological research and in numerous application fields such as diagnostics, biotechnology, forensic biology, biosystematics. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/26G06F19/22G16B45/00
Inventor 罗军伟王俊峰张波张霄宏贾利琴
Owner HENAN POLYTECHNIC UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products