Compression and decompression reduction method and system for CIGAR domain of SAM and BAM files and medium

A file and decompression technology, applied in the field of bioinformatics, can solve the problems of large optimization space in CIGAR domain, which has not been paid attention to, and achieve good compression effect, high compression ratio, and wide application range

Active Publication Date: 2020-01-17
GENETALKS BIO TECH CHANGSHA CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the compression of the sixth field, the CIGAR domain, is still often not paid attention to at present, or the common compression method is used, resulting in a large optimization space for the compression of the CIGAR domain.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Compression and decompression reduction method and system for CIGAR domain of SAM and BAM files and medium
  • Compression and decompression reduction method and system for CIGAR domain of SAM and BAM files and medium
  • Compression and decompression reduction method and system for CIGAR domain of SAM and BAM files and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] As we all know, SAM and BAM files store the results of alignment of short fragment sequences and reference sequences by analysis software. In order to describe the alignment results, SAM and BAM files define the CIGAR domain. This field is the sixth field of SAM and BAM, which records the complete comparison information between the short fragment sequence and the reference sequence, and adopts the rules of number combination operators. For example, "100M", 100 indicates the length of the operator M, and the operator M indicates alignment matching. If the content of the CIGAR field is "100M", it means that the short fragment sequence starts from position 1 to the length of 100, and is consistent with the reference sequence from position POS From the beginning to the length of 100 alignment matches, the position POS value of the reference sequence is recorded in the fourth field of BAM, and which reference sequence (or chromosome) it corresponds to is indexed by the name o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an SAM and BAM file CIGAR domain compression and decompression reduction method and system, and a medium. The method comprises the following steps: coding CIGAR domain data ofan operator into a first specified content from the preprocessing before compression, and skipping to execute the step A6); encoding the CIGAR domain data of the two operators into second specified content only accommodating one operator and a digital part of the operator; cIGAR domain data coding of three or more operators omits a first operator M and a digital part ''\ d + M'' thereof, and the operator M of the last operator obtains a third specified content; preprocessing is carried out by combining the characteristics of a CIGAR domain; therefore, the content of the CIGAR domain can be preprocessed according to a certain rule to achieve high-power compression of the CIGAR domain, and the method has the advantages of being efficient and rapid in preprocessing, high in compression ratioand good in compression effect, can be suitable for two formats of SAM files and BAM files and has the advantage of being wide in application range.

Description

technical field [0001] The invention relates to the compression and restoration technology of SAM and BAM data in the field of biological information, in particular to a method, system and medium for compression, decompression and restoration of the CIGAR domain of SAM and BAM files. Background technique [0002] In bioinformatics, especially in the analysis of high-throughput sequencing data, most of the operations are to realize the comparison (mapping) of short fragment sequences and reference sequences, such as bwa, bowtie, etc., so a unified format needs to be used to Indicates the problem of this mapping result. The SAM (Sequence Alignment Map) file format is to solve this problem. It is a file format used to store the alignment results of sequencing reads and reference sequences. It uses TAB as a separator and supports short reads and long reads on different platforms. reads (up to 128Mbp). However, because the size of the SAM file is usually large, it will be conve...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): H03M7/30
CPCH03M7/3059H03M7/3068
Inventor 徐霞丽李根冯博伦黄能超赵丽霞马丑贤王振国杨耀
Owner GENETALKS BIO TECH CHANGSHA CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products