Genetic variation data-based GDS-Huffman compression method

A compression method and gene mutation technology, applied in the field of life omics analysis, can solve the problem of not considering the excessive size of the intermediate file, and achieve the effect of solving large file processing, efficient compression, and reducing size

Inactive Publication Date: 2019-01-11
SUN YAT SEN UNIV
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] The present invention provides a GDS-Huffman compression method for gene variation data in order to solve the technical defect th

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Genetic variation data-based GDS-Huffman compression method
  • Genetic variation data-based GDS-Huffman compression method
  • Genetic variation data-based GDS-Huffman compression method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0027] In this embodiment, for the GVCF file, based on the GDS compression method, the genotype frequency characteristics are utilized, and the genotype is encoded by Huffman coding, so as to compress the GVCF file more efficiently.

[0028] The Huffman coding tree constructed according to the genotype frequency is as follows image 3 shown. The coding is shown in Table 2. Genotypes will be coded according to the coding table. The integer field in the GVCF file is encoded in a variable-length integer encoding manner.

[0029] Table 2 Genotype Huffman code

[0030] genotype

coding

0 / 0

1

1 / 1

01

0 / 1

001

. / .

0000

1 / 0

0001

[0031] Use the above method for Figure 4 The mutation information in is encoded, and the result is:

[0032] GT:100101

[0033] GQ:00010100 00010100 00010100

[0034] DP: 01000000 01100000 10000000 00000001

[0035] In this embodiment, the GDS-Huffman compression method is applied in the w...

Embodiment 2

[0038] Table 3 is a test of the compression performance of the GDS-Huffman compression method. This test is aimed at the GVCF file. The original data comes from the whole exome sequencing data in the 1000 Genomes Project (The 1000 Genomes Project). The detailed information of the samples can be queried on the official website of the 1000 Genomes. These data have been compared to the reference genome and passed CRAM format files for compression by CRAM. Firstly, the CRAM file was preprocessed by Samtools to obtain the FastQ format file of the original sequence, and then these samples were processed according to the whole exome analysis process to obtain the GVCF file of each sample. Take a GVCF file as input and compress it using the GDS-Huffman compression method. The average compression rate reaches 5.1%, and the compression rate is 4.1M / s.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a genetic variation data-based GDS-Huffman compression method. Genotypes in a GVCF file are coded with Huffman codes according to a genotype frequency on the basis of a GDS compression method; and integer type fields in the GVCF file are coded with a length-varying integer type coding mode; and therefore, a compressed GDS file can be obtained.

Description

technical field [0001] The present invention relates to the technical field of life omics analysis, and more specifically, relates to a GDS-Huffman compression method for gene variation data. Background technique [0002] With the increase in the number of samples for life omics analysis, the VCF files of gene mutation data generated by genome and whole exome analysis are getting larger and larger. For example, the number of samples studied in the precision medicine program can reach tens of thousands, and the VCF generated by the whole exome analysis of so many samples may reach the order of terabytes, and the read and write speed of large files is slow, so it is very difficult to process. Seriously reduces the analysis speed and becomes a computing bottleneck. To study the new VCF file organization method of gene mutation data and reduce the file size is an effective way to solve this problem. [0003] A typical VCF format such as figure 1 shown. As can be seen from th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G16B30/00H03M7/40
CPCH03M7/40
Inventor 邓元帅李伟忠
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products