Comparison gene sequencing data compression method and system and computer readable medium

A gene sequencing and data compression technology, applied in sequence analysis, instruments, electrical components, etc., can solve the problems of long algorithm compression/decompression time, decrease in compression rate, and difference in compression algorithm performance, and achieve stable compression performance and low compression rate. , the effect of short compression time

Active Publication Date: 2019-07-16
GENETALKS BIO TECH CHANGSHA CO LTD
View PDF8 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] According to the comparative research results of the researchers on the existing gene sequencing data compression methods, whether it is a general compression algorithm, a compression algorithm without a reference genome, or a compression algorithm with a reference genome, there are problems as follows: 1. The compression rate has further There is room for decline; 2. When a relatively good compression rate is obtained, the compression/decompression time of the algorithm is relatively long, and the time cost becomes a new problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Comparison gene sequencing data compression method and system and computer readable medium
  • Comparison gene sequencing data compression method and system and computer readable medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0035] see figure 1 In this embodiment, the implementation steps of the comparative gene sequencing data compression method include:

[0036] 1) Traverse and obtain the read sequence R with the read length Lr from the gene sequencing data sample data;

[0037] 2) For each read sequence R, select k original gene letters as the original gene string CS 0 , from the original gene string CS 0 Start to use the k-bit length as the sliding window to sequentially generate fixed-length k-bit strings as short K-mers, and compare the short K-mers with the reference genome in order to obtain their relative positions in the positive or negative strands of the reference genome. For the predicted character c in the adjacent position, the predicted character set PS composed of all predicted characters c is obtained; the Lr-k original gene letter in the read sequence R that does not contain the k-bit original gene letter and the predicted character set PS are encoded and passed through the re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a comparison gene sequencing data compression method and system and a computer readable medium. The compression method comprises the steps that initial gene character string CS0 is selected for each read R in a gene sequencing data sample; a short string K-mer with the length being k is generated according to the sequence, the short string K-mer and a reference-based genomeare sequentially compared so as to obtain the adjacent predicting characters c of the short string K-mer in the plus strand or the minus strand of the reference-based genome, and a predicting character set PS composed of all predicting characters c is obtained; invertible computation is conducted through an invertible function after the Lr-k locus of the read R and the predicting character set PSare encoded; and the plus/minus strand type d of the read R, CS0 and the invertible computation result serve as three data streams to be compressed and output. The method has the advantages of low compression rate, short compression time and stable compression property, does not need to conduct precise comparison on the gene data, and has the high computation efficiency, and the higher the precision degree is, the lower the compression rate is.

Description

technical field [0001] The present invention relates to gene sequencing and data compression technology, in particular to a comparative gene sequencing data compression method, system and computer readable medium. Background technique [0002] In recent years, with the continuous advancement of Next Generation Sequence (NGS), gene sequencing has become faster and cheaper, and gene sequencing technology has been used in a wider range of biological, medical, health, criminal investigation, agriculture, etc. Many fields have been promoted and applied, resulting in the explosive growth of the amount of raw data generated by gene sequencing at a rate of 3 to 5 times or even faster every year. Moreover, the data of each gene sequencing sample is very large, for example, the 55x whole genome sequencing data of a person is about 400GB. Therefore, the storage, management, retrieval and transmission of massive genetic test data face technical and cost challenges. [0003] Data compr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G16B50/50G16B30/00H03M7/40H03M7/30
CPCH03M7/40H03M7/3086G16B30/00G16B50/00
Inventor 李根宋卓刘蓬侠王振国冯博伦马丑贤
Owner GENETALKS BIO TECH CHANGSHA CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products