Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Similarity analysis method of negative sequence mode based on biological sequence, implementation system and medium

A similarity analysis and biological sequence technology, applied in the application field of high-utility negative sequence rules, can solve problems such as lack of similarity measurement methods

Active Publication Date: 2021-01-05
山东元竞信息科技有限公司
View PDF7 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The existing similarity analysis methods are mainly for PSP. For the NSP we excavated earlier, there is still a lack of a unified similarity measurement method.
However, sequence alignment has some disadvantages, prompting people to try to find other methods to compare DNA sequence similarity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity analysis method of negative sequence mode based on biological sequence, implementation system and medium
  • Similarity analysis method of negative sequence mode based on biological sequence, implementation system and medium
  • Similarity analysis method of negative sequence mode based on biological sequence, implementation system and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0083] A similarity analysis method based on negative sequence patterns of biological sequences, such as figure 1 shown, including the following steps:

[0084] (1) Data preprocessing

[0085] For each sequence or genome to be processed, preprocessing is performed before it is subjected to frequent pattern mining. The letters in the DNA sequence are represented by numbers; since the length of the DNA sequence is very long, the DNA sequence represented by the number is divided into several blocks, and the number of bases in each block is the same, and the obtained blocks are used as frequent pattern mining. data set;

[0086] In the present invention, each sequence is first divided into several blocks, and each block is composed of the same number of continuous bases. These blocks are independent of each other, and the block size can vary in practice. Note that if the size of the last block is smaller than the specified block size, then this block will be discarded. To mak...

Embodiment 2

[0099] According to a kind of similarity analysis method based on the negative sequence pattern of biological sequence described in embodiment 1, its difference is:

[0100] In step (2), the f-NSP algorithm is used to mine the data set, the data set is D, and the steps are as follows:

[0101] A. Use the GSP algorithm to obtain all positive and frequent sequences, and store the bitmap corresponding to each positive and frequent sequence in the hash table; including:

[0102] a. Scan the data set to get all sequence patterns with a length of 1 and put them into the original seed set P 1 middle;

[0103] b. From the original seed set P 1 Obtain sequence patterns with a length of 1, and connect them to generate a candidate sequence set C with a length of 2 2 ; Use the Apriori property on the candidate sequence set C 2 Perform pruning, and then scan the candidate sequence set C 2 Determine the support of the remaining sequences, save the sequence patterns whose support is hig...

Embodiment 3

[0116] According to the similarity analysis method of a negative sequence pattern based on a biological sequence described in Example 1, the difference is that in step (3), the maximum frequent positive and negative sequence patterns are graphically represented, including: constructing in the complex plane A purine-pyrimidine diagram, in the purine-pyrimidine diagram, the first and second quadrants are purines, including A, G and The third and fourth quadrants are pyrimidines, including T, C and

[0117] (b+di)→A(Ⅰ)

[0118] (d+bi)→G(Ⅱ)

[0119] (b-di)→T(Ⅲ)

[0120] (d-bi)→C(Ⅳ)

[0121]

[0122]

[0123]

[0124]

[0125] A unit vector of the four nucleotides A, G, T, C and their corresponding negative sequences As shown in formula (Ⅰ) to formula (Ⅷ):

[0126] In formula (I) to formula (VIII), b and d are non-zero real numbers, A and T are conjugated, and so are G and C, ie, A, T, C, G represent actual base pairs, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a similarity analysis method of a biological sequence based on a negative sequence mode, an implementation system and a medium. The similarity analysis method comprises the following steps: (1) data preprocessing: representing letters in a DNA sequence with numbers; dividing the data into a plurality of blocks, and taking the obtained blocks as a data set for frequent pattern mining; (2) frequent pattern mining: using an fNSP algorithm to mine a data set; (3) performing graphic representation on the maximum frequent positive and negative sequence modes; converting themaximum frequent positive and negative sequence modes into a digital sequence; (4) similarity analysis of the DNA sequences: solving the similarity of different DNA sequences, and selecting the DNA sequence corresponding to the minimum similarity as the DNA sequence to be researched. According to the method, the negative sequence can be effectively expressed and analyzed, and different analysis results can be obtained by selecting different maximum frequent pattern combinations, and therefore the memory and time consumption of a computer are greatly reduced.

Description

technical field [0001] The invention relates to a similarity analysis method, a realization system and a medium of a biological sequence-based negative sequence pattern, and belongs to the application technical field of decision-making high-utility negative sequence rules. Background technique [0002] In recent years, we have obtained a large amount of biological sequence data. With the advancement of DNA and protein sequencing technology, it is very important to interpret various information contained in biological sequence data, especially the genetic and regulatory information in DNA sequences, protein sequence structure and The demand for data analysis tools for functional relationships increases, and sequence similarity analysis is widely used. Whenever we obtain a new DNA sequence, we hope to prove that it is similar to some known sequences through similarity analysis. If it has homology with known sequences, it will greatly save the function of re-determining the new...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/16G16B30/10G16B45/00G16B50/00
CPCG16B30/10G16B45/00G16B50/00G06F17/16G16B40/30
Inventor 董祥军芦月
Owner 山东元竞信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products