DNA sequence cluster of Locality Sensitive Hashing based on standard entropy

A locally sensitive hashing and DNA sequence technology, applied in computer components, instruments, calculations, etc., can solve problems such as unstable clustering results, high time complexity, and difficult calculation of sequence data

Inactive Publication Date: 2017-08-29
FUJIAN NORMAL UNIV
View PDF3 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Based on the division-based K-medoid algorithm and the hierarchical-based complete-link algorithm, these traditional clustering algorithms require pairwise comparison of sequences, which has a high time complexity. Today, the number of DNA sequences is growing extremely fast. Algorithms cannot be applied to massive data
The K-means algorithm needs to determine the number of clusters, and the centroid of sequence data is not easy to calculate. The random initial cluster center makes the clustering result unstable, and the clustering effect when applied to biological sequence data is not good.
The result of the clustering algorithm based on the BAG graph is effective, but it needs to be guided by the clustering unit when dividing the class, and the number of sequences in the gene pool is too large, which makes it extremely difficult to use an undirected graph to represent too many sequences

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • DNA sequence cluster of Locality Sensitive Hashing based on standard entropy
  • DNA sequence cluster of Locality Sensitive Hashing based on standard entropy
  • DNA sequence cluster of Locality Sensitive Hashing based on standard entropy

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] Such as figure 1 Shown, the present invention is based on the DNA sequence clustering of the standard entropy local sensitive hash, and it comprises the following steps:

[0030] (1) The entire sequence to be tested is sequenced by second-generation sequencing technology to obtain a batch of short DNA fragments, and each short fragment is called a DNA fragment sequence;

[0031] (2) The letter set in the DNA fragment sequence is {A, C, G, T}, |∑| represents the number of letters in the letter set, initialize the word length L of the word to be processed, and use a fixed The length of the sliding window to obtain the set of words Y to be processed, the number of words Y to be processed in the set of words Y to be processed is |∑| L , according to the position information X of each word to be processed t Calculate its entropy value h;

[0032] The position information X of the word to be processed t Refers to the reciprocal of the distance between the two correspondin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a DNA sequence cluster of a Locality Sensitive Hashing based on standard entropy. The method comprises the steps of mapping an original DNA sequence according to an L-Gram model, calculating a matrix constituted by LF entropy values of N pieces of sequences to obtain the standard entropy of the matrix, using the Locality Sensitive Hashing to conduct Hashing mapping on the standard entropy to obtain a candidate set of DNA fragment sequences, and calculating DNA fragment sequences of which the editing distance is shorter than d in the candidate set to obtain a cluster result. According to the DNA sequence cluster of the Locality Sensitive Hashing based on the standard entropy, the situation that enough original DNA information is included in a converted characteristic space is taken into full consideration, missing of the DNA information is avoided, each fragment of the DNA sequences is turned into a new space, a candidate DNA fragment sequence set of each piece of DNA fragment sequences is calculated, and thus the operation speed can be increased, and the operation precision can be improved.

Description

technical field [0001] The invention relates to the field of biological information processing, in particular to DNA sequence clustering based on local sensitive hashing of standard entropy. Background technique [0002] With the advent of the Internet age and the development of information technology, the development of gene sequencing technology has become more and more mature. In addition to the development of various genetic projects, the amount of biological data has increased exponentially. Traditional methods have been unable to meet the massive data processing and analysis. . Bioinformatics refers to the combination of biology and computer technology, interaction with mathematics disciplines, acquisition of biological information for its processing, extraction, analysis, storage, etc., and mining of genetic material location information. Data mining technology is a technology that can extract useful and potentially effective information from a large amount of data. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/20G06K9/62
CPCG16B25/00G06F18/23
Inventor 江育娥徐彭娜林劼
Owner FUJIAN NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products