Remote protein homology detection and fold recognition method based on Top-n-gram
A technology of top-n-gram and identification method, applied in the field of protein long-distance homology detection and folding identification, can solve the problem that the binary spectrum cannot find the optimal threshold value, and cannot distinguish the difference in the frequency of amino acid occurrence.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
specific Embodiment approach 1
[0044] Specific implementation mode one: combine figure 1 To illustrate this embodiment, a Top-n-gram-based protein remote homology detection and folding recognition method, the specific steps are:
[0045] Step 1: Run PSI-BLAST, input the test protein sequence for multiple sequence alignment, and calculate the pseudo count g of amino acid i i :
[0046] g i = Σ j = 1 20 f j * ( q ij / p j )
[0047] where f j is the observed frequency of amino acid j, p j is the background frequency of amino acid j, q ij is the fraction of the substitution matrix corresponding between amino acid i and amino acid j;
[0048] Step 2: generate a frequency spectrum accordin...
specific Embodiment approach 2
[0059] Specific embodiment 2: This embodiment is a further description of step 2 in the Top-n-gram-based protein remote homology detection and folding recognition method described in specific embodiment 1. The method to generate the frequency spectrum is:
[0060] Calculate the target frequency Q of 20 standard amino acids at each amino acid position in the test protein sequence i :
[0061] Q i =(αf i +βg i ) / (α+β)
[0062] Among them, β is a free parameter, which is the default value of 10 in PSI-BLAST, and α is all the amino acid species appearing in a column in the multiple sequence alignment minus 1;
[0063] Represent the frequency spectrum as a matrix M 1 , whose dimension is L 1 ×N 1 , where L 1 is the length of the protein sequence, N 1 is a constant 20, which is the number of standard amino acids, M 1 The elements in are the target frequency Q i .
[0064] target frequency Q i Indicates the occurrence frequency of a certain amino acid at a specific posi...
specific Embodiment approach 3
[0065] Specific embodiment three: This embodiment is a further description of step three in the method for remote homology detection and folding recognition of proteins based on Top-n-gram described in specific embodiment one. The method of converting the frequency spectrum into Top-n-gram is:
[0066] Arrange the 20 standard amino acids in each row of the frequency spectrum in descending order according to their target frequencies, and then combine the top n amino acids with the highest target frequencies into a Top-n-gram according to their frequencies, and each Top-n-gram is passed through amino acids in Different positions in the Top-n-gram distinguish their different frequencies, and a total of L Top-n-grams are obtained, where n is an integer greater than or equal to 1 and less than or equal to 5.
[0067] The value of n can be an integer greater than or equal to 1 and less than or equal to 20, but in practice, an integer of greater than or equal to 1 and less than or eq...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 