Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data

A decompression and data technology, applied in concurrent instruction execution, electrical digital data processing, special data processing applications, etc., can solve problems such as parallel algorithm research articles that have not yet seen multi-core CPUs

Inactive Publication Date: 2014-02-05
INST OF SOFTWARE - CHINESE ACAD OF SCI +1
View PDF3 Cites 43 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The above-mentioned G-SQZ algorithm and DSRC algorithm are both serial algorithms, and there are no research articles and patents on parallel algorithms based on multi-node multi-core CPUs related to this type of algorithm.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
  • Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
  • Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0062] The present invention provides a method for parallel compression and decompression of FASTQ files of DNA reading sequence data. In order to make the purpose, technical solution and effect of the present invention clearer and clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0063] The raw data reading thread in the parallel compression method of the FASTQ file is explained in detail below, and its specific implementation steps are as follows:

[0064] (1) Open the FASTQ compressed file of the raw DNA read sequence data to be compressed.

[0065] (2) Obtain the memory paging size of the file system of the currently running machine.

[0066] (3) Set the memory mapping space size according to the memory paging size.

[0067] (4) According to the rang...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for realizing parallel compression and parallel decompression on an FASTQ file containing DNA (deoxyribonucleic acid) sequence read data. By aiming at the compression and the decompression of the FASTQ file containing the DNA sequence read data, by utilizing circular double buffering queues, circular double memory mapping and memory mapping and by combining the data segmentation processing technology, the multi-thread streamline parallel compression and parallel decompression technology, the read-write sequence two-dimensional array technology and the like, the parallel compression and the parallel decompression between multiple processes of the FASTQ file and between in-process multiple threads is realized. The parallel compression and parallel decompression can be realized based on MPI and OpenMP, and also can be realized based on the MPI and Pthread (POSIX thread). According to the method disclosed by the invention, by fully utilizing all computational nodes and the powerful computational capability of an intra-node multi-core CPU (central processing unit), constraints of resources, such as a processor, a memory and the like, on a serial compression and decompression program, can be released.

Description

technical field [0001] The invention relates to the fields of biological information, data compression and high-performance computing, in particular to a method for parallel compression and parallel decompression of DNA reading sequence data FASTQ files. Background technique [0002] One of the main tasks of bioinformatics is to collect and analyze large amounts of genetic data. These data are critical for genetic research, helping to identify genetic components that prevent or cause disease and develop targeted therapies. High-throughput sequencing methods and equipment generate massive amounts of short-read sequence data. The common way to store, manage and transmit DNA read sequence data is to use the FASTQ file format, which mainly contains DNA read sequence data and annotation information corresponding to each DNA base, such as Quality, which represents the uncertainty of the sequencing labeling process. Scores information. Read sequence markers and other description...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/38G06F17/30H03M7/30
Inventor 郑晶晶王婷张常有詹科
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products