Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Leon-RC compression method for genome sequencing data

A genome sequencing and compression method technology, applied in the field of biological information, can solve the problems of low compression rate, long time to find anchor points, no consideration of mirror repetition, inversion repetition, etc., to reduce size, optimize construction process, reduce effect of size

Active Publication Date: 2019-01-22
SUN YAT SEN UNIV
View PDF6 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] Most of the compression algorithms based on the reference genome and the compression algorithm without the reference genome have a very low compression rate, because the process of comparing / mapping the original data and the reference genome takes too long
Even though Leon uses the Bloom filter and the De Bruijn graph mosaic algorithm, which can efficiently compare and map the original file and the temporary reference genome with a time complexity of O(1), but in the process of implementation, the second generation The similarity feature of sequencing data is not fully used, and only two cases of direct repetition and complementary palindrome are considered
If only direct repeats and complementary palindrome are considered in the matching process of Kmer and De Bruijn diagrams, but the possible mirror repeats and inverted repeats are not considered, it may cause some short reads not to be found Anchors, so that it cannot be mapped to the provisional reference genome, anchors may also exist, but the time to find anchors is too long

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Leon-RC compression method for genome sequencing data
  • Leon-RC compression method for genome sequencing data
  • Leon-RC compression method for genome sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] The present invention provides a Leon-RC compression method of genome sequencing data, which mainly improves the steps of constructing anchor point dictionary by LEON algorithm, including the following steps:

[0039] (1) Divide short reads into multiple Kmers;

[0040] (2) Select a Kmer, calculate the Kmer value of its direct repetition, mirror repetition, inverted repetition, and complementary palindrome, compare these four values, and obtain the smallest Kmer value;

[0041] (3) Put the smallest Kmer value into the Bloom filter for matching search. SolidKmer is stored in the Bloom filter, and judge whether there is the smallest Kmer value in the Solid Kmer; if it exists, add it to the anchor dictionary The smallest Kmer value, and end the search; if it does not exist, get the next Kmer, repeat steps (2), (3);

[0042] (4) If the smallest Kmer value of all Kmers does not exist in Solid Kmer, it means that there is no anchor point for the short read;

[0043] (5) Con...

Embodiment 2

[0058] In this embodiment, compression tests are performed on next-generation sequencing data of different sizes, and the results of the compression tests are as follows image 3 As shown, compared with Leon, Leon-RC significantly improves the compression rate while keeping the compression rate unchanged. Among them, the compression rate of the SRR934718_1 file has increased the most, from 56.16Mb / s to 64.95Mb / s. The increase rate is as high as 15.6%.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a Leon-RC compression method for genome sequencing data. The method mainly aims to improve the steps of constructing an anchor point dictionary by an LEON algorithm. The method includes the following steps that: (1) a short-read is divided into a plurality of Kmers; (2) one Kmer is selected, the Kmer values of the direct repetition, mirror repetition, inversion repetition,and complementary palindrome of the Kmer are calculated, the four values are compared with one another, so that a minimum Kmer value can be obtained; (3) the minimum Kmer value is inputted into a Bloom filter for matching search, a Solid Kmer is stored in the Bloom filter, and whether the minimum Kmer value exists in the Solid Kmer is judged; if the minimum Kmer value exists in the Solid Kmer, the minimum Kmer value is added to the anchor point dictionary, and search is ended; if the minimum Kmer value does not exist in the Solid Kmer, the next Kmer is obtained, and the step (2) and step (3)are repeated; (4) if the minimum Kmer values of all the Kmer do not exist in the Solid Kmer, it is indicated that no anchor point exists in the short-read; and (5) the anchor point dictionary is constructed through the step (1) to (4).

Description

technical field [0001] The invention relates to the field of biological information, more specifically, to a Leon-RC compression method for genome sequencing data. Background technique [0002] There are mainly two existing NGS data compression methods: one is the compression algorithm based on the reference genome, such as QUIP, CRAM, PATHENC, and FASTQZ, etc., and the compressed file stores the mapping between the short read and the reference genome information. There is a high degree of similarity between the genomes of homologous species. Taking humans as an example, the content of the same part of the genome of any two people is as high as 99%. Capable of storing target genomes. [0003] The next-generation sequencing data compression process based on the reference genome is as follows: [0004] (1) Select an appropriate reference genome, and homologous species sequences have an advantage as a reference genome due to their high similarity; [0005] (2) Map the origi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G16B20/30
Inventor 雷志强李伟忠
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products