Parallel acceleration method for sequencing big data genome comparison files

A genome and big data technology, applied in the field of high-performance computing, can solve problems such as the inability to realize the rational use of computing and hardware resources, and achieve the effect of improving the efficiency of file reading and writing, realizing processing time, and reducing the number of

Pending Publication Date: 2020-02-07
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the existing sorting methods and tools, such as SAMtools, cannot achieve reasona

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel acceleration method for sequencing big data genome comparison files
  • Parallel acceleration method for sequencing big data genome comparison files
  • Parallel acceleration method for sequencing big data genome comparison files

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention will be further described in detail below in conjunction with the examples and drawings, but the implementation of the present invention is not limited to this.

[0032] Such as figure 1 As shown, the parallel acceleration method for file sorting of big data genome comparison of the present invention, in summary, includes the following steps:

[0033] Step 101: Obtain a target BAM file to be sorted, read and decompress the target BAM file, and store it in a continuous first buffer B;

[0034] Step 102: Each time the first buffer B is full, the data therein is allocated to multiple threads for sorting respectively, and the processing results of each sorting thread are merged by heap sorting, and an intermediate file is formed after compression;

[0035] Step 103: Read the intermediate files in sequence, associate a second buffer for each intermediate file to be read, read and decompress the intermediate file to be read, and put it into the associated first b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel acceleration method for sequencing big data genome comparison files. The method comprises steps of reading and decompressing a target BAM file, and storing the target BAM file into a continuous first buffer area B; performing multi-thread sorting after the first buffer area B is full, performing merging through heap sorting to form intermediate files; reading theintermediate files in sequence, putting the intermediate files into an associated second buffer area MB, and merging the data of each second buffer area MB through heap sorting; compressing the merged data through a plurality of threads, and writing the compressed data into a result file. The method is advantaged in that the threads are independently distributed for reading and decompressing, thread pools are constructed for decompressing and compressing respectively, the number of developed threads is reduced, multi-thread resources are fully utilized, file reading and writing efficiency isimproved, the number of intermediate files is reduced, the number of times of memory copying operation is reduced, and the processing time is shortened.

Description

Technical field [0001] The invention relates to the field of high-performance computing, in particular to a parallel acceleration method for sorting files of big data genome comparison. Background technique [0002] In recent years, with the advancement of gene sequencing technology, the field of biological gene health has achieved rapid development. The rapid growth of genetic data poses greater challenges to genetic analysis technology. How to quickly process the big data in these biological gene fields has become a hot research direction in bio-informatics and high-performance computing. [0003] In clinical and scientific research, the mainstream analysis process for human genome data includes: genome comparison, sequencing, de-redundancy, indel re-comparison, quality score re-check, mutation detection and other processes. The files to be sorted in the middle are as few as ten Gb and as many as hundreds of Gb. Existing processing software, such as SAMtools, its sorting proce...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G16B30/10G16B50/50G16B50/30
CPCG16B30/10G16B50/50G16B50/30
Inventor 张中海谭光明张春明姚二林
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products