A Parallel Compression Method for Gene Sequencing Data Quality Score

A data quality and quality score technology, applied in bioinformatics, instrumentation, biostatistics, etc., can solve problems such as lack of practicability and low processing speed, and achieve the effect of enhancing practicability, improving processing speed, and strong applicability

Active Publication Date: 2021-06-11
SOUTH CHINA UNIV OF TECH
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The purpose of the present invention is to overcome the shortcomings and deficiencies of the prior art, and propose a parallel compression method for gene sequencing data quality scores, which effectively improves the processing speed of the quality score data compression process, has good scalability, and solves the problem of special quality score compression tools. Due to the problem of lack of practicability due to the low processing speed, it meets the needs of efficiently compressing gene sequencing data in the context of big data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Parallel Compression Method for Gene Sequencing Data Quality Score
  • A Parallel Compression Method for Gene Sequencing Data Quality Score
  • A Parallel Compression Method for Gene Sequencing Data Quality Score

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be further described below in conjunction with specific examples.

[0031] Such as figure 1 As shown, the method for parallel compression of gene sequencing data quality scores provided in this embodiment includes the following steps:

[0032] 1) In the input FASTQ file, each 4 lines represent a piece of gene sequencing information, such as figure 2 shown. The fourth line of the four lines is a quality score, which is equal to the length of the base sequence information in the second line, and each quality score represents the sequencing accuracy of the base data at the same position in the second line. Here only the quality score line is kept for the following compression process.

[0033] 2) For the extracted quality scores, the main thread calculates the score of each row, and the higher the score, the more high-frequency substrings are included. Quality scores are assigned to Category 1 or Category 2 in behavioral units according to t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for parallel compression of gene sequencing data quality scores, comprising the steps of: 1) dividing the FASTQ format file data to obtain the data of the quality score part; 2) calculating the quality score of each line in line units, And classify this line of data according to the score; 3) When the number of quality scores in a category reaches the threshold, or when there are no more quality scores in this category, put the quality scores in this category as a data block into the calculation buffer queue 4) take a data block in the calculation buffer queue by an idle computing unit, perform transformation, use vectorized optimized ZPAQ to encode, and put it into the output buffer queue after completion; 5) The compressed data processed by the output processing unit is output until the output of all the compressed data is completed, and then the maintenance information is added. The technical scheme of the invention has the characteristics of high performance and strong expansibility.

Description

technical field [0001] The invention relates to the technical field of biological gene sequencing data compression, in particular to a parallel compression method for gene sequencing data quality scores. Background technique [0002] With the development of second-generation sequencing technology, the cost of gene sequencing has dropped rapidly. In contrast, the cost of storing and transmitting gene sequencing data has risen sharply in the proportion of expenses. Therefore, it is of great significance to reduce the storage and transmission costs of gene sequencing data. Data compression can effectively reduce the size of gene sequencing data, and is a key technology to reduce storage and transmission costs. Although many achievements have been made in the research of genetic data compression tools, many compression schemes have the problem of too slow compression speed, which has a fatal impact on the practicability of the scheme. [0003] At present, the FASTQ format has ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B50/50G16B40/20
CPCG16B40/20G16B50/50
Inventor 董守斌柯璧新付佳兵胡金龙
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products