Referential DNA sequence compressing method and system

A DNA sequence and compression method technology, applied in sequence analysis, special data processing applications, instruments, etc., can solve the problems of not being able to make full use of the repeatability between DNA sequences, and achieve good practicability and scalability, efficient sequence compression, and good The effect of practicality

Active Publication Date: 2017-08-18
SHANGHAI JIAO TONG UNIV
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this method gives the length of the subsequence when searching for repeats, so it cannot make full use of the repetitive nature of DNA sequences.
The high redu...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Referential DNA sequence compressing method and system
  • Referential DNA sequence compressing method and system
  • Referential DNA sequence compressing method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The present invention will be described in detail below in conjunction with specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

[0029] Such as figure 1 As shown, the structural block diagram of a preferred embodiment of the reference DNA sequence compression system of the present invention includes: a repeat pattern matching module, a compression encoding module, a non-repetitive symbol predictive encoding module and a decompression module. Wherein: the matching module of the repeating pattern utilizes the input reference sequence to generate a reversed Full-text index structure, searches for the longest matching subsequen...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a referential DAN sequence compressing method and system. The method comprises the step of matching repeated patterns, namely, creating an inverted full-text subsequence structure based on an input reference sequence, and performing the longest matching subsequence searching on the input sequence to be compressed, wherein the matching information is applied to the compressed encoding step, and the unmatched symbols are applied to the step of predictive encoding of non-repeated symbols; the step of compressed encoding step, namely, performing compressed encoding on the matching sequence length and location information, wherein the encoding information is applied to decompressing; the step of predictive encoding of non-repeated symbols, namely, receiving unmatched symbols in the step of matching the repeated patterns, and predicting the symbol occurrence probability and encoding through a mixed context model. According to the method, the properties of efficient searching of index data and efficient compressing of the mixed context model to single symbol are fully combined, so that when being compared with that of other referential DNA sequence methods, the compressing rate is high within the acceptable compressing time consumption; the method is high in practicability.

Description

technical field [0001] The invention relates to a DNA sequence compression system, in particular to a reference DNA sequence compression method and system based on a Full-text index structure and a mixed context prediction model. Background technique [0002] DNA molecules are composed of four deoxyribonucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T). The DNA sequence contains the important genetic information of the living body, which is of great significance to the fields of biology, medicine and information. With the development of DNA sequencing technology, more and more DNA data has been stored and utilized, and the number of DNA sequences has grown exponentially. However, the growth rate of data storage capacity is far lower than the growth rate of data. Insufficient data storage space has become an unavoidable real challenge in the process of scientific development. Therefore, how to efficiently store DNA data has become a concern of many researc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/22
CPCG16B30/00
Inventor 熊红凯范雯敬
Owner SHANGHAI JIAO TONG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products