Leon-rc compression method for genome sequencing data

A genome sequencing and compression method technology, applied in the field of biological information, can solve the problems of low compression rate, long time to find anchor points, no consideration of mirror repetition, reverse repetition, etc., to achieve the effect of reducing size and size

Active Publication Date: 2022-03-29
SUN YAT SEN UNIV
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] Most of the compression algorithms based on the reference genome and the compression algorithm without the reference genome have a very low compression rate, because the process of comparing / mapping the original data and the reference genome takes too long
Even though Leon uses the Bloom filter and the De Bruijn graph mosaic algorithm, which can efficiently compare and map the original file and the temporary reference genome with a time complexity of O(1), but in the process of implementation, the second generation The similarity feature of sequencing data is not fully used, and only two cases of direct repetition and complementary palindrome are considered
If only direct repeats and complementary palindrome are considered in the matching process of Kmer and De Bruijn diagrams, but the possible mirror repeats and inverted repeats are not considered, it may cause some short reads not to be found Anchors, so that it cannot be mapped to the provisional reference genome, anchors may also exist, but the time to find anchors is too long

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Leon-rc compression method for genome sequencing data
  • Leon-rc compression method for genome sequencing data
  • Leon-rc compression method for genome sequencing data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0038] The present invention provides a LEON-RC compression method of genome sequencing data, which is mainly to improve the step of constructing an anchor dictionary for the Leon algorithm, including the following steps:

[0039] (1) divide short reading into multiple KMERs;

[0040] (2) Select a KMER to calculate its direct repetition, mirror repetition, reverse repetition, complementary repayment KMER value, compare these four values, get the smallest KMER value;

[0041] (3) Put the smallest KMER value into the Buron filter to match the lookup, and Solidkmer is stored in the Buron filter to determine if there is a minimum KMER value in the Solid Kmer; if it exists, add this to the anchor dictionary The smallest KMER value and ends the lookup; if there is no existence, get the next KMer, repeat steps (2), (3);

[0042] (4) If the smallest KMER value of all KMERs does not exist in the Solid Kmer, the short reading does not exist;

[0043] (5) Construct an anchor dictionary by st...

Embodiment 2

[0058] This embodiment compresses the two-generation sequencing data of different sizes, and the results of the compression test are image 3 As shown, the compression rate is significantly improved compared to Leon, Leon-RC is remarkably increasing in the case where the compression ratio is constant. Where the compression rate of the SRR934718_1 file is maximized, 56.16Mb / s is increased to 64.95MB / s. The increase is as high as 15.6%.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a Leon-RC compression method for genome sequencing data, which mainly improves the steps of constructing an anchor dictionary by the LEON algorithm, including the following steps: (1) dividing short reads into multiple Kmers; (2) selecting one Kmer, calculate the Kmer value of its direct repetition, mirror repetition, inverted repetition, and complementary palindrome, compare these four values, and obtain the smallest Kmer value; (3) Put the smallest Kmer value into the Bloom filter for matching Search, Solid Kmer is stored in the Bloom filter, and judge whether there is a minimum Kmer value in Solid Kmer; if it exists, add the minimum Kmer value to the anchor dictionary and end the search; if not, obtain the next One Kmer, repeat steps (2), (3); (4) If the minimum Kmer value of all Kmers does not exist in Solid Kmer, it means that there is no anchor point for this short read; (5) Pass step (1) ~(4) Construct the anchor dictionary.

Description

Technical field [0001] The present invention relates to the field of biological information, and more particularly to the Leon-RC compression method of genomic sequencing data. Background technique [0002] There are two main second-generation sequencing data compression methods: one is based on a compression algorithm based on a reference genome, such as QuIP, CRAM, PATHENC, and FASTQZ, etc., the compressed file stores a map between short reading and reference genome. information. Highly similarity between the homologous species genome, as an example of human beings, the content of any two people's genome is as high as 99%, so, in the case of obtaining the reference genome, if the 1% additional information can be stored, It is possible to store the target genome. [0003] The second-generation sequencing data compression process based on the reference genome is as follows: [0004] (1) Select the appropriate reference genome, and homologous species sequences have advantages due ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G16B20/30
Inventor 雷志强李伟忠
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products