Remote protein homology detection and fold recognition method based on Top-n-gram

A top-n-gram and recognition method technology, which is applied in the field of protein remote homology detection and folding recognition, can solve the problems of inability to distinguish the difference in the frequency of amino acid occurrence, binary spectrum can not find the optimal threshold, etc., to improve the prediction effect , the effect of removing noise

Inactive Publication Date: 2011-05-04
HARBIN INST OF TECH
View PDF3 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention aims to solve the problem that the binary spectrum cannot find the optimal threshold and distinguish the difference in the frequency of amino acids in the existing protein long-range homology detection and folding recognition methods, and provides a Top-n-gram-based Protein long-range homology detection and folding identification methods

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Remote protein homology detection and fold recognition method based on Top-n-gram
  • Remote protein homology detection and fold recognition method based on Top-n-gram
  • Remote protein homology detection and fold recognition method based on Top-n-gram

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach 1

[0044] Specific implementation mode one: combine figure 1 To illustrate this embodiment, a Top-n-gram-based protein remote homology detection and folding recognition method, the specific steps are:

[0045] Step 1: Run PSI-BLAST, input the test protein sequence for multiple sequence alignment, and calculate the pseudo count g of amino acid i i :

[0046] g i = Σ j = 1 20 f j * ( q ij / p j )

[0047] where f j is the observed frequency of amino acid j, p j is the background frequency of amino acid j, q ij is the fraction of the substitution matrix corresponding between amino acid i and amino acid j;

[0048] Step 2: generate a frequency spectrum according to the pseudo-count of amino acid...

specific Embodiment approach 2

[0059] Specific embodiment 2: This embodiment is a further description of step 2 in the Top-n-gram-based protein remote homology detection and folding recognition method described in specific embodiment 1. The method to generate the frequency spectrum is:

[0060] Calculate the target frequency Q of 20 standard amino acids at each amino acid position in the test protein sequence i :

[0061] Q i =(αf i +βg i ) / (α+β)

[0062] Among them, β is a free parameter, which is the default value of 10 in PSI-BLAST, and α is all the amino acid species appearing in a column in the multiple sequence alignment minus 1;

[0063] Express the frequency spectrum as a matrix M, whose dimension is L×N, where L is the length of the protein sequence, N is a constant 20, that is, the number of standard amino acids, and the elements in M ​​are the target spectral rate Q i .

[0064] target frequency Q i Indicates the occurrence frequency of a certain amino acid at a specific position in a pro...

specific Embodiment approach 3

[0065] Specific embodiment three: This embodiment is a further description of step three in the method for remote homology detection and folding recognition of proteins based on Top-n-gram described in specific embodiment one. The method of converting the frequency spectrum into Top-n-gram is:

[0066] Arrange the 20 standard amino acids in each row of the frequency spectrum in descending order according to their target frequencies, and then combine the top n amino acids with the highest target frequencies into a Top-n-gram according to their frequencies, and each Top-n-gram is passed through amino acids in Different positions in the Top-n-gram distinguish their different frequencies, and a total of L Top-n-grams are obtained, where n is an integer greater than or equal to 1 and less than or equal to 5.

[0067] The value of n can be an integer greater than or equal to 1 and less than or equal to 20, but in practice, an integer of greater than or equal to 1 and less than or eq...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a remote protein homology detection and fold recognition method based on a Top-n-gram, and relates to a remote protein homology detection and fold recognition method. The method is used for solving a problem that a binary spectrum cannot find out an optimal threshold and cannot distinguish difference of frequency of occurrences of amino acid in the prior protein remote homology detection and fold recognition method, and comprises the following steps: 1, operating a PSII-BLAST, inputting a tested protein sequence for multiple sequence alignment, and calculating a pseudocount of an amino acid i; 2, generating a frequency spectrum; 3, transforming the frequency spectrum into the Top-n-gram; 4, obtaining a latent semantic expression vector corresponding to the tested protein sequence; 5, inputting the latent semantic expression vector corresponding to the tested protein sequence into an SVM sorter for sorting, and obtaining a forecasting result. The protein remotehomology detection and fold recognition method based on the Top-n-gram is used in the filed of protein homology detection and fold recognition.

Description

technical field [0001] The invention relates to a protein remote homology detection and folding identification method. Background technique [0002] At present, the remote protein homology detection methods at home and abroad are roughly divided into the following types: dynamic programming algorithm, generative model, and discriminant model. The discriminant model is the method with the best prediction effect in this field, and the method based on Support Vector Machine (SVM) is the most commonly used method at present. The most effective way to improve the prediction effect based on support vector machine method is to find an appropriate protein representation form, and then vectorize the protein sequence. [0003] A large amount of evolutionary information is included in the protein multiple sequence alignment results output by running PSI-BLAST (Position-Specific Iterative BLAST). Because frequency spectra contain more information than protein sequences, it is of great...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F19/16G06F19/18G06F19/22
Inventor 林磊刘滨孙承杰王晓龙刘秉权刘远超
Owner HARBIN INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products