Read and distance distribution based genome De novo sequence splicing method

A sequence splicing and genome technology, applied in the field of bioinformatics, can solve the problems of not considering the sequencing depth, considering the sequencing depth, and unbalanced sequencing depth, so as to solve the problem of unbalanced sequencing depth and complex repetitive regions, and eliminate repetitive regions Effect

Active Publication Date: 2014-12-10
CENT SOUTH UNIV
View PDF4 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method does not consider the sequencing depth when designing the scoring function
[0008] At present, although existing methods use paired-end reads for sequence assembly, there are still two problems that need to be further resolved: (1) Existing methods can eliminate a part of repetitive regions by using partial read sets generated by paired-end reads coming, but cannot overcome the influence of some complex repeat regio...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Read and distance distribution based genome De novo sequence splicing method
  • Read and distance distribution based genome De novo sequence splicing method
  • Read and distance distribution based genome De novo sequence splicing method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] Such as figure 1 Shown, the concrete realization process of the present invention is as follows:

[0037] 1. Construction of De Bruijn diagram

[0038] A read library in fasta format is read in, each read in the read library is the same length, and the left and right reads of the paired-end reads appear sequentially in the library and correspond to the forward and reverse strands, respectively. Only four bases {A, T, G, C} are present in all libraries. Each reading is a string of a certain length. Each k-mer is a string of length k. There are r-k+1 k-mers in total for reads of length r. Each node in the initial De Bruijn graph corresponds to a kmer.

[0039] Read each reading in turn, and find the position of the k-mer in the De Bruijn diagram for the r-k+1 k-mers of each reading. If the k-mer does not exist, add it to the De Bruijn diagram the node. Every two k-mers that are adjacent in this read, that is, the last k-1 bases of one k-mer are the same as the firs...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a read and distance distribution based genome De novo sequence splicing method. In the method, overlap relation between reads is stored by means of a De Bruijn diagram, and a new scoring function is provided on the basis of read distribution to apply to processes of contig construction, scaffolding and blank area filling and the like. The scoring function takes full consideration of sequencing depth, k-mer frequency and deviation of insertsize of a complex repeated region. The method is simple and easy to implement, and has good splicing effect in terms of different simulated and real sequencing data, and has high continuity and integrity as compared with the other sequence splicing method.

Description

technical field [0001] The invention relates to the field of bioinformatics, in particular to a genome de novo sequence splicing method based on read counts and distance distribution. Background technique [0002] Genome generally refers to all coding and non-coding deoxyribonucleic acid (DNA) sequences. DNA sequence is the carrier of genetic information and the basis for the synthesis of protein amino acid sequence, and guides the development of organisms and the operation of life functions. Therefore, DNA sequence is the decisive factor for the existence and development of life, and everything that happens in life activities is inextricably linked with DNA sequence. DNA sequences have become indispensable knowledge in basic biological research and in numerous applied fields such as diagnostics, biotechnology, forensic biology, and biosystematics. Genome sequencing refers to the analysis of the base sequence of a specific DNA fragment, that is, the arrangement of adenine ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/18
Inventor 王建新罗军伟李敏
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products