Parallel accelerating method for BWT index construction for multiple sequences

A multi-sequence and indexing technology, applied in the field of bioinformatics, can solve the problems of slow and inefficient BWT index construction, achieve the effects of reducing the required time, improving the construction process, and being easy to transplant and promote

Inactive Publication Date: 2015-09-09
NAT UNIV OF DEFENSE TECH
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The technical problem to be solved by the present invention is that in the existing large-scale sequence set BWT index construction, the sequence set is sorted in

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel accelerating method for BWT index construction for multiple sequences

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Attached below figure 1 , taking the construction of a BWT index of 1 billion DNA sequences with a length of 100 on a cluster with 64GB of memory at each node as an example (hereinafter referred to as this example), the present invention will be further described in detail. The alphabet of DNA sequence Σ={A, C, G, T}, the size is 4, that is, σ=4.

[0030] like figure 1 As shown, the novel BWT index parallel acceleration algorithm proposed by the present invention mainly includes 8 steps.

[0031] Step 1: Determine the length l of the delimited string used when the suffix is ​​divided into blocks according to the sequence scale and the processor memory size, take For this example, m=10 9 , k=100, σ=4, M=64×2 30 , calculated to get

[0032] Step 2: Calculate σ l =4 4 =256, let’s equip a cluster system containing 256 processors (CPU), respectively numbered as p 1 ,p 2 ,...,p 256 .

[0033] Step 3: Open up σ in the cluster system memory l = 256 buckets, the l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel accelerating method for BWT index construction for multiple sequences. The parallel accelerating method for the BWT index construction for multiple sequences is aimed to solves the problems of slow BWT index construction speed and low efficiency of the existing BWT index construction for a large-scale sequence set due to using a mode of combining in pairs to sort again after carrying out partitioning sorting on the sequence set to continuously recur, combine and sort. According to the technical scheme, the parallel accelerating method for the BWT index construction for multiple sequences includes that traversing all the suffixes of each sequence in the sequence set R, inspecting the first l characters of each suffix, and dividing the suffixes with the same first l characters into the same memory sub-block; independently sorting the suffixes in each sub-block in parallel; splicing the sorted sub-blocks to obtain the order of all the suffixes in the sequence set R; taking the BWT character of each suffix in sequence from the small to the large according to the lexicographical order, and connecting to obtain the BWT index of the sequence set R. The parallel accelerating method for the BWT index construction for multiple sequences has beneficial effects that the BWT index construction for multiple sequences is effectively improved, and the whole genome assembly time is reduced by about 90%.

Description

[0001] Technical field: the present invention relates to the assembly method of the whole genome in the field of biological information, especially the parallel acceleration of Burrows-Wheeler transform (hereinafter referred to as BWT) index construction of large-scale short sequence collection (more than 100 million sequences) in the whole genome assembly process method. Background technique: [0002] Whole genome assembly is the core issue in the field of bioinformatics and the basis and premise of other related research in genomics. The genome of general organisms contains millions or even billions of bases, but the current gene sequencing technology can only measure sequence fragments containing hundreds of bases at a time, according to the overlapping relationship between the short sequences obtained by sequencing The process of reducing short sequences to the original genome is called genome assembly. For N sequence fragments, directly calculating the overlapping relati...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/22
Inventor 彭绍亮朱小谦王恒卢宇彤杨灿群吴诚堃崔英博刘欣王海强程乾夏徐伟
Owner NAT UNIV OF DEFENSE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products