Search algorithm based on DNA k-mer index problem four-node list trie tree

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A search algorithm and DNA sequence technology, applied in computing, digital data processing, special data processing applications, etc., can solve problems such as slow computing speed and large storage capacity, and achieve the effect of saving node space

Inactive Publication Date: 2017-03-08

HARBIN ENG UNIV

View PDF5 Cites 8 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

However, this type of method is applicable to small k. When k is large, the storage capacity is too large and the calculation speed slows down due to the large value.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0025] The present invention is described in more detail below in conjunction with accompanying drawing example:

[0026] The method realizes the optimization of the original data of the traditional dictionary tree, and saves the storage space. At the same time, the leaf node is used as the end mark of k-mer to facilitate the return of query results and reduce the complexity of word search.

[0027] A four-word linked list dictionary tree retrieval algorithm based on the DNA k-mer index problem, including two steps of establishing a four-word retrieval dictionary tree model and word search. Its characteristics are: making further improvements on the basis of the dictionary tree model, preprocessing the original data and using the leaf nodes of the dictionary tree as word end marks. This processing not only has no effect on the query speed but also saves storage space and reduces space complexity.

[0028] A four-word linked list dictionary tree retrieval algorithm based on t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the field of data structures and big data processing, in particular to a novel quick search algorithm based on a trie tree, comprising: establishing a four-node trie tree model, and using four bases of a DNA sequence as system inputs; establishing a trie tree terminal search list, determining a terminal end mark, not distinguishing base sequences, and establishing a model for reversely deducting sequence numbers and base pair numbers upon query; establishing a DNA sequence index and analyzing its complexity; acquiring positions of substrings, hooking a search list to leaf sub-node, and storing position data; querying k-mer short strings, and analyzing their complexity. The longer a common prefix of a word, the higher the query speed of the trie tree; the complexity varies with k differences, is substantially a constant and is nearly not affected by data quantity. Letter mapping is applied to original data, 26 sub-nodes of the trie tree are decreased to 4, and node space is saved.

Description

technical field [0001] The invention belongs to the field of data structure and big data processing, in particular to a four-word linked list dictionary tree retrieval algorithm based on the DNA k-mer index problem. Background technique [0002] The currently implemented projects such as the Thousand Genomes Project, the International Haplotype Mapping Project, and the Mendelian Inherited Disease Project have used next-generation sequencing technology to generate massive DNA sequencing data, also known as high-throughput sequencing data, making bioinformatics data explosive. increase. In the research of life sciences, people have gradually realized that it is not only necessary to use physical, chemical and biological methods to study the material basis of life, energy conversion, metabolic processes, etc., but also to use information science methods to study life information, especially genetic information. Organization, reproduction, transmission, expression and their fun...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

CPCG06F16/2458G06F16/2246

Inventor王辉张旭魏智红童丽峰张一毕文鹏贲浩然车超

OwnerHARBIN ENG UNIV

Search algorithm based on DNA k-mer index problem four-node list trie tree

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology