Unlock instant, AI-driven research and patent intelligence for your innovation.

Gene sequence similarity calculating method based on multiple k values

A similarity calculation and gene sequence technology, applied in sequence analysis, bioinformatics, instruments, etc., can solve the problems of important information loss, less important, unavoidable parameter k selection, etc., and achieve high comparison accuracy, Avoid guesswork, solve the effect of parameter k infinite and random selection

Active Publication Date: 2019-06-18
EAST CHINA NORMAL UNIVERSITY
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

A certain segment in a sequence may be more meaningful and contain more important information for the current sequence alignment, while other segments may be less important, using the same method to deal with different 'k-mer' features may cause some important loss of information
Although there are related methods that consider adding different weights to different "k-mers" to improve the accuracy of sequence alignment, the selection of parameter k still cannot be avoided

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Gene sequence similarity calculating method based on multiple k values
  • Gene sequence similarity calculating method based on multiple k values
  • Gene sequence similarity calculating method based on multiple k values

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] See attached figure 2 , the present invention carries out the gene sequence similarity calculation square based on a plurality of k values ​​according to the following steps:

[0027] (1) Calculate the "Markov" background probability of the gene sequence

[0028] Step 1: Extract the "k-mer" set of the sequence, set a value range of k S=[kmin, kmax],

[0029] For each value of k in S, a "k-mer" set d is extracted from the two sequences by a sliding window of size k k , d k is of size And get the set d of all different k.

[0030] Step 2: Use the maximum likelihood method to estimate the transition probability of "Markov", assuming that each "k-mer" obeys the "Markov" model, and use the maximum likelihood method according to the subsequence distribution of the two sequences x, y Obtain the transition probability T(S of the "Markov" model i , S j ).

[0031] Step 3: According to the "Markov" model, the probability of occurrence of each "k-mer" in the transition i...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a gene sequence similarity calculating method based on multiple k values. The method is characterized in that a plurality of k-mers which are extracted from the k values form an integrated vector of frequency number and frequency; the importance degree of each k-mer is autonomously identified; and three weight-equipped D2-type similarities of MKWD2, MKWDs2 and MKWD*2 are obtained. Compared with the prior art, the gene sequence similarity calculating method has the following fine advantages which are not easily ignored: accurately extracting the characteristic, realizingsimple analysis process, simplifying biological sequence processing, realizing high comparison accuracy, greatly shortening sequence comparison time, greatly improving working efficiency, effectivelypreventing guessing to an optimal parameter k, and settling a problem of infinite and random selection of the parameter k in a relatively good manner.

Description

technical field [0001] The invention relates to the technical field of gene sequence comparison, in particular to a gene sequence similarity calculation method based on multiple k values. Background technique [0002] The calculation of gene sequence similarity is of great significance to the analysis of gene sequence function and structural relationship, as well as the construction of biological evolutionary trees. At present, alignment-based sequence alignment methods are generally used in the alignment of gene sequences. This alignment method is not only time-consuming and laborious, but also has low accuracy and low work efficiency. Today, with the development of science and technology, especially with the development of next-generation gene sequencing technology, the exponential growth of biological data, the alignment-based comparison method has been far from meeting the time requirements for processing large-scale gene sequence data . In order to speed up the proces...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G16B30/00
Inventor 钱莹章炯民张雨
Owner EAST CHINA NORMAL UNIVERSITY