Search algorithm based on DNA k-mer index problem four-node list trie tree

A search algorithm and DNA sequence technology, applied in computing, digital data processing, special data processing applications, etc., can solve problems such as slow computing speed and large storage capacity, and achieve the effect of saving node space

Inactive Publication Date: 2017-03-08
HARBIN ENG UNIV
View PDF5 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this type of method is applicable to small k. When k is large, the storage capacity is too large and the calculation speed slows down due to the large value.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Search algorithm based on DNA k-mer index problem four-node list trie tree
  • Search algorithm based on DNA k-mer index problem four-node list trie tree
  • Search algorithm based on DNA k-mer index problem four-node list trie tree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] The present invention is described in more detail below in conjunction with accompanying drawing example:

[0026] The method realizes the optimization of the original data of the traditional dictionary tree, and saves the storage space. At the same time, the leaf node is used as the end mark of k-mer to facilitate the return of query results and reduce the complexity of word search.

[0027] A four-word linked list dictionary tree retrieval algorithm based on the DNA k-mer index problem, including two steps of establishing a four-word retrieval dictionary tree model and word search. Its characteristics are: making further improvements on the basis of the dictionary tree model, preprocessing the original data and using the leaf nodes of the dictionary tree as word end marks. This processing not only has no effect on the query speed but also saves storage space and reduces space complexity.

[0028] A four-word linked list dictionary tree retrieval algorithm based on t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of data structures and big data processing, in particular to a novel quick search algorithm based on a trie tree, comprising: establishing a four-node trie tree model, and using four bases of a DNA sequence as system inputs; establishing a trie tree terminal search list, determining a terminal end mark, not distinguishing base sequences, and establishing a model for reversely deducting sequence numbers and base pair numbers upon query; establishing a DNA sequence index and analyzing its complexity; acquiring positions of substrings, hooking a search list to leaf sub-node, and storing position data; querying k-mer short strings, and analyzing their complexity. The longer a common prefix of a word, the higher the query speed of the trie tree; the complexity varies with k differences, is substantially a constant and is nearly not affected by data quantity. Letter mapping is applied to original data, 26 sub-nodes of the trie tree are decreased to 4, and node space is saved.

Description

technical field [0001] The invention belongs to the field of data structure and big data processing, in particular to a four-word linked list dictionary tree retrieval algorithm based on the DNA k-mer index problem. Background technique [0002] The currently implemented projects such as the Thousand Genomes Project, the International Haplotype Mapping Project, and the Mendelian Inherited Disease Project have used next-generation sequencing technology to generate massive DNA sequencing data, also known as high-throughput sequencing data, making bioinformatics data explosive. increase. In the research of life sciences, people have gradually realized that it is not only necessary to use physical, chemical and biological methods to study the material basis of life, energy conversion, metabolic processes, etc., but also to use information science methods to study life information, especially genetic information. Organization, reproduction, transmission, expression and their fun...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2458G06F16/2246
Inventor 王辉张旭魏智红童丽峰张一毕文鹏贲浩然车超
Owner HARBIN ENG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products