Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A method for compressing next-generation sequencing data

A second-generation sequencing and compression method technology, which is applied in the fields of electrical digital data processing, special data processing applications, instruments, etc., can solve the problem of low compression ratio, achieve the effect of reducing storage space and improving processing speed

Active Publication Date: 2018-05-29
SHENZHEN HUADA GENE INST
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the compression ratio of these algorithms and tools is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method for compressing next-generation sequencing data
  • A method for compressing next-generation sequencing data
  • A method for compressing next-generation sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] In this embodiment, the data of the 1,000-genome is used as an example for illustration. The sample NA12345 is one sample data among more than 1,000 samples of the 1,000-genome. Here, for the convenience of description, NA12345 is used as an example for illustration. The next-generation sequencing data of the sample data is stored in the fastq format, and the corresponding file name is example.fastq. Next, the next-generation sequencing data of the Thousand Genomes are compressed using the aforementioned steps S11-S16.

[0035] In this embodiment, step S11 generates a BSSL initial file. details as follows.

[0036] In step S11, first use the split command to split example.fastq into multiple small files of 80000000 lines (that is, the aforementioned first preset length, of course, the first preset length can also be other values); the system can automatically Name the resulting small file, eg the first file will be named exampleaa.fastq. The split command is a comman...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a compression method for next generation sequencing data. The method comprises: dividing the next generation sequencing data of each sample according to a first preset length, to generate a BSSL original file; according to a second preset length, establishing a cutting tag file; according to the cutting tag file, processing the BSSL original file, to obtain BSSL intermediate files; combining the BSSL intermediate files to obtain a BSSL final file; counting a frequency distribution result of a seed sequence in the BSSL final file, to obtain a seed file according to the result; combined with the format characteristics of the sequencing data, determining compression rules, and based on the seed file, compressing the next generation sequencing data of each sample. Through dividing the next generation sequencing data and performing parallel processing, processing speed is improved, and combined with seed sequence selection, the seed file is obtained, and the next generation sequencing data is compressed according to the format characteristics of the sequencing data and the seed file, so that storage space of the next generation sequencing data is greatly reduced.

Description

technical field [0001] The invention relates to the technical field of biological information and data compression, in particular to a method for compressing next-generation sequencing data. Background technique [0002] DNA (deoxyribonucleic acid) is a double-helix long-chain polymer used in cells to store biological genetic instructions for a long time. It is a base pair sequence formed by pairing four bases: adenine (A), thymine (T), guanine (G) and cytosine (C). [0003] With the implementation of large-scale international cooperative research projects such as the Human Genome Project, a series of studies including genomics, transcriptomics, RNA (ribonucleic acid) omics and proteomics have led to the generation of massive data, and further data analysis Storage and transmission present additional challenges. Data compression is conducive to saving storage space and improving the efficiency of data exchange and network transmission, which is equally important for massiv...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F19/10
Inventor 严志祥杨洁操利超游丽金张勇周欣
Owner SHENZHEN HUADA GENE INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products