Gene sequencing data compression method, system and computer readable medium

A gene sequencing and data compression technology, applied in computing, electrical digital data processing, special data processing applications, etc., can solve the problems of different compression algorithm performance, long algorithm compression/decompression time, and reduced compression rate, and achieve short compression time. , the effect of stable compression performance and low compression rate

Active Publication Date: 2020-07-24
GENETALKS BIO TECH CHANGSHA CO LTD
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] According to the comparative research results of the researchers on the existing gene sequencing data compression methods, whether it is a general compression algorithm, a compression algorithm without a reference genome, or a compression algorithm with a reference genome, there are problems as follows: 1. The compression rate has further There is room for decline; 2. When a relatively good compression rate is obtained, the compression / decompression time of the algorithm is relatively long, and the time cost becomes a new problem
However, for a compression algorithm with a reference genome, the selection of the reference genome will lead to the stability of the algorithm performance, that is, when processing the same target sample data, when different reference genomes are selected, the performance of the compression algorithm may be significantly different; while using the same The reference genome selection strategy, when dealing with the same kind of gene sequencing sample data, the performance of the compression algorithm may also have significant differences

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene sequencing data compression method, system and computer readable medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0044] see figure 1 The method for compressing gene sequencing data in this embodiment includes:

[0045] 1) Traverse and obtain the read sequence R with the read length Lr from the gene sequencing data sample data;

[0046] 2) For each read sequence R, select k original gene letters as the original gene string CS 0 , from the original gene string CS 0 Start to generate a fixed-length k-bit string as a short string K-mer in a sliding window sequence of length k, determine the positive and negative strand type d of the read sequence R according to the short string K-mer, and use the preset prediction data model P1 Obtain the predicted character c corresponding to the adjacent bit of each short string K-mer to obtain a predicted character set PS with a length of Lr-k bits, and the predicted data model P1 includes any short string K- in the positive and negative strands of the reference genome mer and the predicted character c corresponding to its adjacent bits; the Lr-k origi...

Embodiment 2

[0091] This embodiment is basically the same as Embodiment 1, and the main difference is that the prediction data model P1 in step 1) is different.

[0092] In this embodiment, the prediction data model P1 is based on the base letter c corresponding to the short string K-mer in the reference genome and its adjacent bits in advance. 0 Complete the trained neural network model; Step 2.2.2) For each tuple (k-mer, 0) in the positive chain prediction sequence KP1, obtain its corresponding prediction character c through the mapping function mapping of the prediction data model P1. Input each tuple (k-mer, 0) in the positive chain prediction sequence KP1 into the neural network model to obtain the predicted character c corresponding to the tuple (k-mer, 0); step 2.2.4) for negative chain prediction Each tuple (k-mer, 1) in the sequence KP2 is mapped through the mapping function of the predicted data model P1 to obtain the predicted character c corresponding to its adjacent bits. Spec...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a gene sequencing data compression method, system and computer-readable medium. The compression method includes traversing and obtaining read sequences with a read length of Lr, generating a short string K-mer for each read sequence, and selecting the original gene string CS 0 And determine the positive and negative chain type d, obtain the predicted character c of each short string K-mer through the predicted data model P1 to obtain the predicted character set PS, encode the Lr-k bit of the read sequence R, and the predicted character set PS through a reversible function Perform reversible operation; the positive and negative chain type d, CS of the read sequence R 0 And reversible operation results are compressed and output. The present invention has the advantages of low compression rate, short compression time, and stable compression performance. It does not require precise comparison of genetic data and has high calculation efficiency. The more repeated strings in , the lower the compression ratio.

Description

technical field [0001] The invention relates to gene sequencing and data compression technology, in particular to a gene sequencing data compression method, system and computer readable medium. Background technique [0002] In recent years, with the continuous advancement of Next Generation Sequence (NGS), gene sequencing has become faster and cheaper, and gene sequencing technology has been used in a wider range of biological, medical, health, criminal investigation, agriculture, etc. Many fields have been promoted and applied, resulting in the explosive growth of the amount of raw data generated by gene sequencing at a rate of 3 to 5 times or even faster every year. Moreover, the data of each gene sequencing sample is very large, for example, the 55x whole genome sequencing data of a person is about 400GB. Therefore, the storage, management, retrieval and transmission of massive genetic test data face technical and cost challenges. [0003] Data compression is one of the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B50/50
CPCG16B25/00G16B50/00
Inventor 李根宋卓刘蓬侠王振国冯博伦
Owner GENETALKS BIO TECH CHANGSHA CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products