Supercharge Your Innovation With Domain-Expert AI Agents!

Method for compressing genomic data

A genome and data technology, applied in the fields of electrical digital data processing, special data processing applications, instruments, etc., can solve the problems of IT cost and obstacles.

Inactive Publication Date: 2018-03-27
GOTTFRIED WILHELM LEIBNIZ UNIV HANNOVER
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to this large volume of data, IT costs can be a major barrier compared to sequencing costs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for compressing genomic data
  • Method for compressing genomic data
  • Method for compressing genomic data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048] figure 1 A possible encoder structure for the proposed sequence compression algorithm is shown. Current reads i > 1 should be compressed. Read segment i With CIGAR i string, pos i and seq as a nucleotide sequence i . Read segment i These three data parameters are passed to extension module 1.

[0049] The extension module maps the position pos by using i and CIGAR string CIGAR i Expand current read i Nucleotide sequence seq i . The result of the extension module is the union sequence exp i .

[0050] exp i The codewords are passed to ring buffer 2. Ring buffer 2 is a last-in, last-out container, particularly a modifiable and variable size container, thus remembering the N previous extended reads. exp j , 1≤ji Compare to calculate the expanded nucleotide sequence exp for the current read using difference module 3 i Expanded nucleotide sequence exp with previous reads j the difference between. The calculated difference, along with the minimum differenc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention relates to a method for compressing genomic data, whereby the genomic data are stored in at least one data file containing at least a plurality of reads built by a genome sequencing method, whereby each read includes a mapping position, a CIGAR string and an actual sequenced nucleotide sequence as a local part of the donor genome, comprising the steps: - unwind a nucleotide sequence of a current read of one of said data files by using the mapping position and the CIGAR string of said current read, whereby said current read has at least one previous read, - compute a difference between the unwound nucleotide sequence of said current read and an unwound nucleotide sequence of at least one of said previous reads, whereby said difference contains the differences of the mapping positions and the nucleotide sequences, - pass said computed difference to an entropy coder to compress said difference, - encode said current read by the compressed difference, and - repeat theforgoing steps with said current read as one of said previous reads and a following read as a new current read until no more following reads are available.

Description

technical field [0001] The present invention relates to a method for compressing genomic data stored in at least one data file comprising at least a plurality of mapped and / or aligned reads constructed by a genome sequencing method , where each read includes the mapped position, the CIGAR string, and the actual sequenced nucleotide sequence as a partial part of the donor genome. Background technique [0002] Sequencing of large amounts of genetic information has become affordable thanks to novel high-energy sequencing (HTS) and / or next-generation sequencing (NGS) technologies. Due to this large volume of data, IT costs can become a major hurdle compared to sequencing costs. High-performance compression of genomic data is required to reduce storage size and transmission costs. [0003] In such data files, among other data, nucleotide sequences, mapped positions, alignment information (CIGAR strings) and quality scores are stored. Such a structure is described, for example,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/22G16B50/50G16B30/10G16B40/00
CPCG06F16/1744G16B30/00G16B40/00G16B50/00G16B30/10G16B50/50H03M7/3059
Inventor M·曼德龙J·福格斯J·奥斯特曼
Owner GOTTFRIED WILHELM LEIBNIZ UNIV HANNOVER
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More