Compression method for next generation sequencing data

A technology of second-generation sequencing and compression methods, which is applied in the fields of electrical digital data processing, special data processing applications, instruments, etc., can solve the problem of low compression ratio, achieve the effect of reducing storage space and improving processing speed

Active Publication Date: 2016-07-13
SHENZHEN HUADA GENE INST
View PDF4 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the compression ratio of these algorithms and tools is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Compression method for next generation sequencing data
  • Compression method for next generation sequencing data
  • Compression method for next generation sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] This embodiment takes the data of a thousand human genomes as an example for description, where the sample NA12345 is one of more than one thousand samples of the thousand human genomes. Here, for the convenience of description, NA12345 is taken as an example for illustration. The second-generation sequencing data of the sample data is stored in fastq format, and the corresponding file name is example.fastq. The following steps S11 to S16 are used to compress the second-generation sequencing data of the thousand-person genome.

[0035] In this embodiment, step S11 generates a BSSL initial file. details as follows.

[0036] In step S11, first use the split command to split example.fastq into multiple small files of 80000000 lines (that is, the aforementioned first preset length, of course, the first preset length can also be other values); the system can automatically Name the resulting small file. For example, the first file will be named exampleaa.fastq. The split comman...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a compression method for next generation sequencing data. The method comprises: dividing the next generation sequencing data of each sample according to a first preset length, to generate a BSSL original file; according to a second preset length, establishing a cutting tag file; according to the cutting tag file, processing the BSSL original file, to obtain BSSL intermediate files; combining the BSSL intermediate files to obtain a BSSL final file; counting a frequency distribution result of a seed sequence in the BSSL final file, to obtain a seed file according to the result; combined with the format characteristics of the sequencing data, determining compression rules, and based on the seed file, compressing the next generation sequencing data of each sample. Through dividing the next generation sequencing data and performing parallel processing, processing speed is improved, and combined with seed sequence selection, the seed file is obtained, and the next generation sequencing data is compressed according to the format characteristics of the sequencing data and the seed file, so that storage space of the next generation sequencing data is greatly reduced.

Description

Technical field [0001] The invention relates to the technical field of biological information and data compression, in particular to a compression method of second-generation sequencing data. Background technique [0002] DNA (deoxyribonucleic acid) is a double-helical long-chain polymer used for long-term storage of biological genetic instruction information in cells. It is a base pair sequence composed of four bases: adenine (A), thymine (T), guanine (G) and cytosine (C). [0003] With the implementation of large-scale international cooperative research projects such as the Human Genome Project, a series of studies including genomics, transcriptomics, RNA (ribonucleic acid) omics, and proteomics have led to the generation of massive amounts of data, Storage and transmission present more challenges. Data compression is conducive to saving storage space, improving data exchange and network transmission efficiency, which is also important for massive biological information data. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/10
Inventor 严志祥杨洁操利超游丽金张勇周欣
Owner SHENZHEN HUADA GENE INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products