Parallel rapid matching method and system for stored DNA sequence

A technology of DNA sequence and matching method, which is applied in the field of parallel fast matching method of DNA sequence and its system, which can solve the problem of low efficiency of DNA sequence matching and achieve the effect of improving efficiency and speeding up operation

Inactive Publication Date: 2016-11-09
SHENZHEN UNIV
View PDF5 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In view of this, the purpose of the present invention is to provide a parallel fast matching method and syste

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel rapid matching method and system for stored DNA sequence
  • Parallel rapid matching method and system for stored DNA sequence
  • Parallel rapid matching method and system for stored DNA sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0059] The specific embodiment of the present invention provides a parallel fast matching method for stored DNA sequences, which is applied to the compressed storage of DNA sequences, wherein the method mainly includes the following steps:

[0060] S11. Hash index construction step: construct a hash index based on the reference genome in FASTA format based on the prefix, find out all the kmers with the specified prefix and use them as key values ​​to build a hash index table, and each entry stores the position where the kmer appears ;

[0061] S12, file block step: input the DNA sequence file in F...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a parallel rapid matching method and system for a stored DNA sequence. The parallel rapid matching method and system are applied to compressed storage for a DNA sequence. The method comprises the steps that a Hash index is built, wherein the Hash index is built based on a reference genome of a prefix for the FASTA format, all kmers of the designated prefix are found, a Hash index table is built with the kmers as key values, and each table stores corresponding kmer appearing position; a file is segmented, wherein the DNA sequence file with the FASTQ format is input and segmented; multithread processing is carried out, wherein multiple threads are started for processing multiple tasks determined by the number of threads, the multiple sub blocks call a matching function rapidly positioned based on the kmer Hash index at the same time, the sub blocks are matched into the target reference genome with the FASTA format in parallel, and the purpose of compressed storage is achieved by substituting the original DNA sequence with a storage matching result.

Description

technical field [0001] The invention relates to the field of data compression, in particular to a stored DNA sequence-oriented parallel fast matching method and a system thereof. Background technique [0002] The development of next-generation sequencing technology has promoted the generation of high-throughput DNA sequencing data. The exponential growth rate of data exceeds the growth rate of computer microprocessors and storage devices. High-throughput DNA sequencing data compression technology is an effective way to solve DNA sequence Methods of storage and transmission. Before being applied to compressed storage, a common practice is to match the high-throughput sequencing data FASTQ sequence file to the existing genome, that is, the reference genome. The format of the reference genome file is the FASTA file format, which stores the target sequence and the reference genome. The matching result replaces the original sequence to achieve the purpose of compressed storage, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F19/20G06F17/30G06F9/48
CPCG06F9/485G06F16/2255G16B25/00
Inventor 朱泽轩邓清津储颖孙怡雯
Owner SHENZHEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products