Amino acid sequence design method taking given protein main chain structure as target
A main chain structure and sequence design technology, applied in the field of protein design, can solve the problems of high degree of freedom of all atoms, difficult to optimize, difficult to verify, etc., to achieve the effect of rapid design and simplified optimization process
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0086] Embodiment 1 Encoder-decoder network construction
[0087] The encoder part of the encoder-decoder network of ABACUS-R is a Transformer whose input includes the side chain type and 3D backbone structure information of all other residues that are structurally distant from the central residue. It is emphasized that the side chain type of the central residue and the side chain conformations of neighboring residues are not used as input. The output of the encoder constitutes the desired vector representation, which is decoded into various attributes of the central residue. This encoder-decoder network is trained on a selected set of PDB structures. figure 1 A self-consistent iteration to design a full sequence of given target backbone structures is shown. The method can start from a randomly selected initial sequence, then apply a pretrained encoder-decoder to its local environment (depending on the side chain types of surrounding residues in the current sequence) to one ...
Embodiment 2
[0088] Embodiment 2 The training of encoder-decoder network
[0089]A set of non-redundant PDB structures has been used to train encoder-decoder networks. We split the selected PDB structures into training and testing sets in two different ways, thus learning two sets of network parameters. The first set of network parameters (Model eval ) by using about 95% of the protein structures for training and using the remaining about 5% for testing, the structures used for testing belong to the single-domain topology type (CATH4 .2 classification) obtained. With this selection of test proteins, none of the test structures belong to the same CATH topology as the training structures. Therefore Model eval Can be used for unbiased computational evaluation. The second set of network parameters Model final is learned by randomly splitting protein structures into roughly 95% for training and 5% for testing, disregarding their CATH structural classification. model final Sequences that...
Embodiment 3
[0093] Embodiment 3 Convergence of sequence design iterations
[0094] The present invention applies Model eval and self-consistent iteration to generate data from the Model eval The full sequence of 100 target structure designs obtained in the test set. These target structures cover three main CATH categories. For each target structure, 10 sequences were designed using 10 different runs, each starting from a different random initial sequence. Since the iterative method is actually a greedy algorithm that maximizes the (predicted) probability of the sidechain type, we monitored the evolution of the negative logarithm of the designed sidechain type probability (-logP value) during the iterative run. The average per-residue -logP value decreases and converges to a plateau value. At the same time, the side chain types of most residues converged towards the corresponding types in the final sequence. For all target structures, iterative runs can produce self-consistent sequenc...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


