Alignment gene sequencing data compression method, system and computer readable medium

A gene sequencing and data compression technology, which is applied in computing, electrical digital data processing, special data processing applications, etc., can solve the problems of compression algorithm performance differences, algorithmic compression/decompression time is long, compression rate drops, etc., to achieve short compression time , Stable compression performance and low compression ratio

Active Publication Date: 2020-07-17
GENETALKS BIO TECH CHANGSHA CO LTD
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] According to the comparative research results of the researchers on the existing gene sequencing data compression methods, whether it is a general compression algorithm, a compression algorithm without a reference genome, or a compression algorithm with a reference genome, there are problems as follows: 1. The compression rate has further There is room for decline; 2. When a relatively good compression rate is obtained, the compression / decompression time of the algorithm is relatively long, and the time cost becomes a new problem
However, for a compression algorithm with a reference genome, the selection of the reference genome will lead to the stability of the algorithm performance, that is, when processing the same target sample data, when different reference genomes are selected, the performance of the compression algorithm may be significantly different; while using the same The reference genome selection strategy, when dealing with the same kind of gene sequencing sample data, the performance of the compression algorithm may also have significant differences

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Alignment gene sequencing data compression method, system and computer readable medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] see figure 1 In this embodiment, the implementation steps of the comparative gene sequencing data compression method include:

[0036] 1) Traverse and obtain the read sequence R with the read length Lr from the gene sequencing data sample data;

[0037] 2) For each read sequence R, select k original gene letters as the original gene string CS 0 , from the original gene string CS 0 Start to generate a fixed-length k-bit string as a short string of K-mers in a sliding window sequence of length k, and compare the short strings of K-mers with the reference genome in order to obtain their relative position in the positive or negative strand of the reference genome. For the predicted character c in the adjacent position, the predicted character set PS composed of all predicted characters c is obtained; the Lr-k original gene letter in the read sequence R that does not contain the k-bit original gene letter and the predicted character set PS are encoded and passed through th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a comparison type gene sequencing data compression method, system and computer-readable medium. The compression method selects the original gene string CS for each read sequence R in the gene sequencing data sample. 0 , generate a short string K-mer of length k in sequence, compare the short string K-mer with the reference genome in turn to obtain the predicted character c of its adjacent position in the positive or negative strand of the reference genome, and obtain all predictions The predicted character set PS composed of character c; the Lr-k bit of the read sequence R and the predicted character set PS are encoded and then reversible operations are performed through a reversible function; the positive and negative chain types d, CS of the read sequence R 0 And the reversible operation results are compressed and output as three data streams. The present invention has the advantages of low compression rate, short compression time, and stable compression performance, does not require precise comparison of genetic data, and has high calculation efficiency. The higher the comparison accuracy, the lower the compression rate.

Description

technical field [0001] The present invention relates to gene sequencing and data compression technology, in particular to a comparative gene sequencing data compression method, system and computer readable medium. Background technique [0002] In recent years, with the continuous advancement of Next Generation Sequence (NGS), gene sequencing has become faster and cheaper, and gene sequencing technology has been used in a wider range of biological, medical, health, criminal investigation, agriculture, etc. Many fields have been promoted and applied, resulting in the explosive growth of the amount of raw data generated by gene sequencing at a rate of 3 to 5 times or even faster every year. Moreover, the data of each gene sequencing sample is very large, for example, the 55x whole genome sequencing data of a person is about 400GB. Therefore, the storage, management, retrieval and transmission of massive genetic test data face technical and cost challenges. [0003] Data compr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B50/50G16B30/00H03M7/40H03M7/30
CPCH03M7/40H03M7/3086G16B30/00G16B50/00
Inventor 李根宋卓刘蓬侠王振国冯博伦马丑贤
Owner GENETALKS BIO TECH CHANGSHA CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products