Monte carlo optimization of protein thermostability screening method incorporating evolutionary information

By integrating evolutionary information and Monte Carlo optimization methods, combined with molecular dynamics simulations, the problems of limited information and uncertainty in combinatorial mutations in protein thermal stability modification were solved, achieving efficient and accurate protein thermal stability screening.

CN122050499BActive Publication Date: 2026-06-26青岛奔月生物技术有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
青岛奔月生物技术有限公司
Filing Date
2026-04-16
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies for modifying protein thermal stability suffer from problems such as limited information dimensions, uncertain combined mutation effects, and the inability of static calculations to reflect dynamic behavior, resulting in low design efficiency, high costs, and poor results.

Method used

The Monte Carlo optimization method, which integrates evolutionary information, screens candidate mutation sites through multiple sequence alignment, constructs a mutation combination space, optimizes the combination using the Monte Carlo search algorithm, and verifies structural stability by combining molecular dynamics simulations, thereby achieving automated screening.

Benefits of technology

It significantly improves the efficiency and accuracy of protein thermostability screening, reduces costs and time, and ensures that the final mutants have actual thermostability improvements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122050499B_ABST
    Figure CN122050499B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of protein engineering and computational biology, and particularly relates to a fusion evolutionary information Monte Carlo optimization protein thermostability screening method, screening of candidate mutation sites with potential thermostability improvement effect, and determining the allowed mutation set of each candidate mutation site; based on the determined candidate mutation site, a constrained mutation combination space is constructed, and a mutation combination set is generated through a reproducible random sampling strategy; energy scores of each mutation combination are obtained, a Monte Carlo search algorithm is used to iteratively optimize the candidate mutation combination, the optimal candidate mutation combination is screened, and the optimal candidate mutant is determined; molecular dynamics simulation is performed on the optimal candidate mutant, and a protein mutant with actual thermostability improvement is verified and determined through a structure stability index, and the screening efficiency is significantly improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of protein engineering and computational biology, specifically to a Monte Carlo method for optimizing protein thermal stability screening by incorporating evolutionary information. Background Technology

[0002] Protein thermal stability is a crucial performance indicator in protein engineering, industrial catalysis, and biomedicine. Efficiently and accurately obtaining mutants with enhanced thermal stability has been a key research focus in this field. Traditional and existing computationally assisted protein thermal stability design methods still have many shortcomings, as detailed below:

[0003] 1. Over-reliance on a single information source leads to low design efficiency;

[0004] Existing methods mostly rely solely on sequence evolution information (such as multiple sequence alignment (MSA) conservation analysis) or structural information (such as B-factor residue flexibility analysis), resulting in a limited information dimension. These methods can only locate potential key residues and cannot directly determine the optimal amino acid substitution type. They still require extensive saturation mutation experiments for verification, making them semi-rational designs with low overall efficiency and high degree of uncertainty.

[0005] 2. The improvement in stability due to single-point mutation is limited;

[0006] Single-point mutations based on ΔΔG or unfolding energies typically have limited effect on improving protein thermal stability, making it difficult to meet the high-temperature tolerance requirements of industrial applications. Therefore, combined mutation strategies are generally required in practical modifications, but efficient combined optimization methods are lacking.

[0007] 3. Combinatorial mutations exhibit epistatic effects, with significant uncertainty;

[0008] After combining multiple positive single-point mutations, an epistatic effect often occurs, causing the melting temperature (Tm) of the combined mutant to be lower than that of the wild type. To overcome this problem, current techniques have to revert to enumeration-based combination experiments, which results in an exponential increase in workload, is time-consuming, costly, and has a low success rate.

[0009] 4. Static calculations cannot reflect dynamic behavior, resulting in insufficient prediction accuracy;

[0010] Existing energy calculations such as ΔΔG are all based on static structures and cannot reflect the true dynamic conformational evolution of proteins in solution environments. However, the thermal stability of proteins is essentially the stability of their dynamic structures. Static predictions are insufficient to accurately assess the true effects of combinatorial mutations and are prone to the contradiction of "theoretically optimal energy, but actually dynamically unstable."

[0011] In summary, existing technologies lack an automated protein thermal stability screening method that integrates multi-source information, can efficiently optimize combined mutations, and takes into account both static energy and dynamic structure verification, making it difficult to achieve efficient, accurate, and low-cost thermal stability modification. Summary of the Invention

[0012] The technical problem to be solved by this invention is to overcome the shortcomings of the prior art and provide a Monte Carlo optimized protein thermal stability screening method that incorporates evolutionary information.

[0013] This invention is achieved through the following technical solution: a Monte Carlo optimization method for screening protein thermostability by incorporating evolutionary information, comprising the following steps:

[0014] S1. Screen candidate mutation sites with potential thermal stability enhancement effects, and determine the allowable mutation set for each candidate mutation site;

[0015] S2. Based on the candidate mutation sites determined in S1, construct a constrained mutation combination space, and generate a set of candidate mutation combinations through a reproducible random sampling strategy.

[0016] S3. Obtain the energy score of each candidate mutation combination, and use the Monte Carlo search algorithm to iteratively optimize the candidate mutation combinations and select the optimal candidate mutation combination set.

[0017] S4. Perform molecular dynamics simulations on the mutants of each candidate mutation combination in the optimal candidate mutation combination set, and verify and determine the protein mutants with improved actual thermal stability through structural stability indices.

[0018] S1 includes the following sub-steps:

[0019] S1-1. Perform multiple sequence alignment (MSA) on the target protein and its homologous sequences, calculate the amino acid distribution probability at each site and obtain the information entropy H(i) of the site, construct a mutation tendency scoring function S(i,a) for mutation tolerance and mutation evolution preference at fusion sites, and screen sites with information entropy H(i) greater than the preset information entropy threshold and scores greater than the preset score threshold as preliminary candidate mutation sites.

[0020] S1-2. Calculate the folding free energy difference of single-point mutations for the initial screening candidate mutation sites in S1-1, and select sites with a free energy difference of less than 0 as the final candidate mutation sites.

[0021] The formula for calculating the information entropy H(i) of the site in S1-1 is as follows:

[0022] ;

[0023] in, This represents the probability of amino acid a being observed at the i-th site in a protein sequence, where a represents the amino acid type.

[0024] The expression for the mutation propensity scoring function S(i,a) is as follows:

[0025] ;

[0026] in, The wild-type amino acid is represented by α and β, which are weighting parameters, 0≤α≤1, 0≤β≤1, and α+β=1.

[0027] S2 includes the following sub-steps:

[0028] S2-1. Let the set of candidate mutation sites be P = {p1, p2, ..., p...} i , ..., p N}, where N is the total number of candidate mutation sites, p i For the i-th candidate mutation site, any mutation combination C = {p} i1 p i2 , ..., p ik} represents a subset of k candidate mutation sites selected from set P;

[0029] S2-2. Introduce a deterministic random seed s and initialize the random number generator RNG=Init (s) to achieve the reproducibility of the sampling process;

[0030] S2-3. Select the number of candidate mutation sites k using a random number generator. Perform sampling without replacement in the candidate mutation site set P to generate a single candidate mutation combination C. Repeat the above sampling without replacement operation to generate a candidate mutation combination set containing M candidate mutation combinations. This set is denoted as {C1, C2, ..., C...}. M}

[0031] In S2-1, the value range of k is 2≤k≤6.

[0032] S3 includes the following sub-steps:

[0033] S3-1. Calculate the protein folding free energy corresponding to each candidate mutation combination in the candidate mutation combination set, and use it as the energy score of each candidate mutation combination.

[0034] S3-2. Randomly select a candidate mutation combination from the candidate mutation combination set as the initial candidate mutation combination, and calculate the total energy of the initial candidate mutation combination;

[0035] S3-3. Randomly select several candidate mutation sites in the current candidate mutation combination for adjustment, generate a new candidate mutation combination, and calculate the total energy of the new candidate mutation combination.

[0036] S3-4. Determine whether to accept a new candidate mutation combination as the current candidate mutation combination. The acceptance rules are as follows: If the total energy of the new candidate mutation combination is less than the total energy of the current candidate mutation combination, then accept it directly; otherwise, accept it with the preset acceptance probability.

[0037] S3-5. Repeat S3-3 to S3-4 for multiple iterations (forming a Monte Carlo cycle). After each iteration, record the candidate mutation combination with the lowest total energy and its total energy.

[0038] S3-6. When the iteration reaches the preset number of iterations, stop the iteration. Sort the candidate mutation combinations of each iteration in the record from low to high according to the total energy. Select the candidate mutation combinations with the lowest preset number or preset proportion of total energy as the optimal candidate mutation combination set.

[0039] The total energy of the initial candidate mutation combination and the total energy of the new candidate mutation combination are equal to the overall folding free energy of the protein corresponding to the candidate mutation combination.

[0040] In S3-4, the preset acceptance probability is adjusted by the temperature parameter kT, and the formula for calculating the preset acceptance probability is:

[0041] Preset acceptance probability = ;

[0042] Among them, E current This represents the total energy of the current candidate mutation combinations. This represents the total energy of the new candidate mutation combinations.

[0043] S4 includes the following sub-steps:

[0044] S4-1. Based on Newton's equation of motion, molecular dynamics simulations were performed on the mutants of each candidate mutation combination in the optimal candidate mutation combination set to obtain the conformational evolution trajectory of the protein.

[0045] S4-2. Calculate the root mean square offset (RMSD) of the mutant and wild-type protein for each candidate mutation combination in S4-1. Compare the RMSD values ​​of the mutant and wild-type. If the RMSD value of the mutant is less than the RMSD value of the wild-type protein, then the mutant of the candidate mutation combination is the final protein mutant with improved thermal stability.

[0046] The formula for calculating the root mean square offset (RMSD) is:

[0047] RMSD(t) = ;

[0048] in, Indicates the initial structure coordinates. Let N represent the spatial coordinates of the i-th atom at time t, where N represents the total number of atoms and t represents time.

[0049] Compared with the prior art, the beneficial effects of the present invention are:

[0050] This application integrates sequence evolution information and structure-energy information for a two-layer screening process, overcoming the limitations of a single information source, thereby obtaining highly reliable candidate mutation sites and optimal amino acid types. This eliminates the need for numerous saturation mutation experiments and significantly improves screening efficiency.

[0051] This application overcomes the bottleneck of limited stability improvement of single-point mutation by constructing a mutation combinatorial space and performing multi-mutation synergistic optimization. It can achieve a thermal stability improvement effect that is far superior to that of single-point mutation, and better meets the needs of practical applications.

[0052] This application employs the Monte Carlo algorithm to perform efficient iterative optimization of the combinatorial mutation space, avoiding the computational explosion caused by full enumeration. At the same time, it weakens the influence of the superposition effect through global energy optimal screening, reduces invalid combinatorial experiments, and significantly reduces costs and cycle time.

[0053] This application first obtains the statically optimal mutant combination through Monte Carlo optimization, then obtains the conformational evolution trajectory through molecular dynamics simulation, and verifies the dynamic structural stability using RMSD as an indicator, reflecting the dynamic behavior of the protein in solution, solving the problem of static optimization and dynamic instability, and ensuring that the final mutant has a practically implementable improvement in thermal stability.

[0054] This application achieves reproducible sampling through deterministic random seeds. The entire process, from site selection, combination generation, Monte Carlo optimization to dynamic verification, forms a fully automated closed loop, requiring no repeated manual intervention. It is highly versatile and can be widely applied to the rational modification of the thermal stability of various proteins. Attached Figure Description

[0055] Figure 1 This is a flowchart of the method used in this application;

[0056] Figure 2 This is a schematic diagram of the structural stability of the wild-type protein and the optimal candidate mutant in Example 2. Detailed Implementation

[0057] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of the present invention.

[0058] Example 1

[0059] This embodiment proposes a Monte Carlo optimization method for screening protein thermal stability that incorporates evolutionary information, comprising the following steps:

[0060] S1. Screening candidate mutation sites: By integrating protein sequence evolution information and energy prediction methods, candidate mutation sites with potential thermal stability enhancement effects are screened, and the allowable mutation set for each candidate mutation site is determined.

[0061] S1 includes the following sub-steps:

[0062] S1-1. Perform multiple sequence alignment (MSA) on the target protein and its homologous sequences, calculate the amino acid distribution probability at each site and obtain the information entropy H(i) of the site, construct a mutation tendency scoring function S(i,a) for mutation tolerance and mutation evolution preference at fusion sites, and screen sites with information entropy H(i) greater than the preset information entropy threshold and scores greater than the preset score threshold as preliminary candidate mutation sites.

[0063] By analyzing the variation patterns of protein sequences during evolution, potential mutagenic sites can be identified from the perspectives of information theory and statistical physics. The basic principle is that proteins are constrained by structural stability and function during evolution, leading to significant differences in the tolerance of different sites for amino acid substitutions.

[0064] Protein sequences can be viewed as a set of statistical samples formed under evolutionary pressure. During evolution, key structural or functional sites are strongly constrained and their distribution is relatively concentrated (low information entropy), while non-key sites can undergo diverse mutations and their distribution is relatively dispersed (high information entropy). Therefore, the information entropy H(i) of a site can be used to characterize its mutation tolerance, as expressed below:

[0065] ;

[0066] in, H(i) represents the probability of observing amino acid a at the i-th site in the protein sequence, where a represents the amino acid type. The larger H(i) is, the less conserved the site is, and the more suitable it is for mutation; the smaller H(i) is, the more conserved the site is, and mutation should be avoided.

[0067] By performing multiple sequence alignment on the homologous sequences of the target protein, the site-amino acid frequency matrix was obtained as follows:

[0068] ;

[0069] in, The expression represents the number of times amino acid 'a' appears at the i-th site, N represents the total number of sequences, and q(a) represents the background amino acid frequency. This represents a pseudo-count parameter.

[0070] λ is used to avoid probability estimation bias caused by insufficient samples or zero frequency in multiple sequence alignment statistics. In multiple sequence alignment, when certain amino acids are not observed at specific sites, directly using frequency statistics will result in pi(a)=0, thus affecting the stability of subsequent information entropy calculation and mutation scoring. Therefore, this application introduces λ to incorporate q(a) into probability estimation, thereby smoothing the observation frequency and improving statistical robustness. In one embodiment, λ is a preset constant, preferably ranging from 1 to 20, and more preferably λ=5. λ is used to adjust the observation frequency. The weighted relationship between λ and the background frequency q(a) is such that when λ is small, Closer to the observation frequency; when λ is large, It is closer to the background distribution. The value of λ does not change the basic form of the probability distribution, but it affects the smoothness of the frequency estimation: smaller λ values ​​rely more on the observed data, while larger λ values ​​enhance the influence of the background distribution, thereby improving the estimation stability in the case of sparse data.

[0071] This application does not only use the conservatism judgment, but also constructs a mutation tendency scoring function S(i,a), the expression of which is as follows:

[0072] ;

[0073] in, The value represents wild-type amino acids, with α and β as weighting parameters, where 0 ≤ α ≤ 1, 0 ≤ β ≤ 1, and α + β = 1. Preferably, α and β can each be set to 0.5. This score considers both whether the site allows mutation (information entropy) and whether the mutation direction is reasonable (evolutionary preference).

[0074] Therefore, sites whose information entropy H(i) > preset information entropy threshold H are selected. min And the score S(i,a) > the preset score threshold S threshold The site is used as a preliminary candidate mutation site, and the allowed mutation set for each preliminary candidate mutation site is determined. Multiple candidate amino acids can be retained at the same site to form a mutation search space.

[0075] In practical applications, the score values ​​of 20 standard amino acids are used to determine the allowable mutation set, and the initial screening results are output in a structured format as a standard data file. Simultaneously, the wild-type amino acid information and corresponding allowable mutation set for each site are automatically recorded, and the results are output in a structured format as a standard data file (such as CSV) for subsequent mutation screening and combination generation steps. Through this process, the extraction of candidate mutation information from MSA results is automated, eliminating the need for manual site-by-site analysis, thus significantly improving the efficiency and consistency of mutation site screening.

[0076] S1-2. Calculate the folding free energy difference of single-point mutations for the initial screening candidate mutation sites in S1-1, and select sites with a free energy difference of less than 0 as the final candidate mutation sites.

[0077] This step primarily uses energy prediction methods to screen sites that may improve thermal stability. Energy prediction methods mainly assess the impact of single-point mutations in the protein on thermal stability. For each candidate residue in the protein structure, the difference in free energy ΔG between the wild-type (WT) and mutant forms is calculated to determine the potential contribution of the mutation to thermal stability. The calculation formula is as follows:

[0078] ;

[0079] in, This represents the folding free energy of wild-type proteins. This represents the folding free energy after a single point mutation.

[0080] The fold free energy ΔG of any protein can be expressed as the sum of the energies of various interactions and conformations, and its general expression is:

[0081] ;

[0082] in, This indicates the interaction between hydrophobicity and van der Waals forces. Indicates the hydrogen bond energy. This represents the energy of salt bridges or ion interactions. This represents the conformational energy of the main chain and side chains. Represents the solvent effect energy. It represents electrostatic energy.

[0083] Therefore, the folding free energy of wild-type proteins and the folding free energy after a single-point mutation All of these can be calculated using the general expression for the folding free energy ΔG of proteins mentioned above.

[0084] This application selects residues with ΔΔG<0 (increased stability) as potential mutation sites to provide a basis for subsequent combinatorial mutation design.

[0085] S2. Mutation Combination Generation: Based on the candidate mutation sites determined in S1, a constrained mutation combination space is constructed, and a set of candidate mutation combinations is generated through a reproducible random sampling strategy.

[0086] To address the problem of the exponential growth of the multi-site combination mutation space, this application models the mutation combination construction process as a constrained combination sampling problem and designs a random subset generation strategy to achieve efficient search.

[0087] S2 includes the following sub-steps:

[0088] S2-1. Let the set of candidate mutation sites be P = {p1, p2, ..., p...} i , ..., p N}, where N is the total number of candidate mutation sites, p i For the i-th candidate mutation site, any mutation combination C = {p} i1 p i2 , ..., p ik} represents a subset of k candidate mutation sites selected from set P (the number of candidate mutation sites in the mutation combination). The preferred value of k is 2-6.

[0089] S2-2. Introduce a deterministic random seed s and initialize the random number generator RNG=Init (s) to achieve the reproducibility of the sampling process;

[0090] This invention does not employ full combinatorial enumeration, but instead achieves search space compression by constraining the mutation order. Compared to full combinatorial enumeration, this constraint avoids the conformational instability risk of high-order combinations (more than 5 combinatorial mutations); reduces computational complexity, allowing it to be completed with high-throughput computing resources (e.g., 500–1000 combinations); and, consistent with the epistasis principle, low-order combinations are more likely to retain positive synergistic effects.

[0091] Under given constraints, this invention uses a random sampling method to generate mutation combinations and introduces a deterministic random seed s to achieve the reproducibility of the sampling process. s is a preset constant or determined by input parameters, and the same s can ensure consistent generation results.

[0092] Furthermore, in the specific implementation, a random seed 's' of integer type is first set, and then initialized using the initialization function RNG=Init(s). For example, in a Python environment, this can be achieved by calling random.seed(s) or numpy.random.seed(s). After initialization, the random number generator generates uniformly distributed random numbers throughout the entire mutation combination sampling process to control the selection of mutation sites and the combination generation process. The introduction of the random seed 's' transforms the random sampling process from "completely random" to "deterministic pseudo-random," thus ensuring consistent mutation combination results can be obtained through multiple runs under the same 's' condition, guaranteeing the reproducibility of the calculation process and the verifiability of the results. The random seed 's' is an integer parameter that can be determined by a preset fixed value. For example, setting 's' to 42 means initializing with a fixed random seed 's'=42. The random seed 's' does not change the statistical distribution characteristics of the random numbers, but it does change the specific value order of the random number sequence, thereby affecting the specific sampling path of the mutation combination. Different random seeds correspond to different sampling trajectories, but statistically, they do not affect the overall coverage characteristics of the mutation space.

[0093] S2-3. Select the number of candidate mutation sites k using a random number generator. Perform sampling without replacement in the candidate mutation site set P to generate a single candidate mutation combination C. Repeat the above sampling without replacement operation to generate a candidate mutation combination set containing M candidate mutation combinations. This set is denoted as {C1, C2, ..., C...}. M}

[0094] Within the constraints, the number of candidate mutation sites, k, is randomly selected and expressed as: k Uniform(k) min k max This process is controlled by the random number generator RNG. min k max These are the minimum and maximum values ​​selected, respectively.

[0095] The single candidate mutation combination obtained by sampling without replacement from the candidate mutation site set P is represented as follows:

[0096] C=RandomSample(P,k;RNG);

[0097] C represents a single candidate mutation combination.

[0098] S3. Obtain the energy score of each mutation combination, and use the Monte Carlo search algorithm to iteratively optimize the candidate mutation combinations to select the optimal candidate mutation combination set.

[0099] S3 includes the following sub-steps:

[0100] S3-1. Calculate the protein folding free energy ΔG for each candidate mutation combination in the candidate mutation combination set using the PackRotamers and FastRelax algorithms in the Rosetta suite. total This is used as the energy score for each candidate mutation combination.

[0101] Next, this application uses the Monte Carlo (MC) method to iteratively select low-energy combinations based on energy scores, approximating the globally optimal candidate mutation combinations.

[0102] Furthermore, in order to find the combination most likely to improve thermal stability from multiple candidate mutation sites, this invention uses the Monte Carlo method to iteratively optimize the candidate mutation combination, specifically including the contents described in S3-2 to S3-6.

[0103] S3-2. Randomly select a candidate mutation combination from the candidate mutation combination set as the initial candidate mutation combination C0, and calculate the total energy E0 of the initial candidate mutation combination C0.

[0104] S3-3. Randomly select several candidate mutation sites from the current candidate mutation combinations and adjust them to generate a new candidate mutation combination C. new And calculate the total energy E of the new candidate mutation combinations. new .

[0105] The total energy of a candidate mutation combination is equal to the overall folding free energy of the protein corresponding to that candidate mutation combination, i.e., E = ΔG. total (C), where E represents the total energy of the candidate mutation combination, C represents the candidate mutation combination, and ΔG total This represents the overall folding free energy of the protein corresponding to the candidate mutation combination.

[0106] Therefore, the total energy E0 of the initial candidate mutation combination C0 and the new candidate mutation combination C new Total energy E new They are respectively:

[0107] E0=ΔG total (C0);

[0108] E new =ΔG total (C new ).

[0109] S3-4. Determine whether to accept the new candidate mutation combination C according to the Metropolis criteria. new Specifically, if the total energy of a new candidate mutation combination is less than the total energy of the current candidate mutation combination, it is accepted directly; otherwise, it is accepted with a preset acceptance probability.

[0110] That is, if E new <E current Then accept C new As a current candidate mutation combination;

[0111] If E new ≥E current Then accept C with the preset acceptance probability. new As the current candidate mutation combination, E current This represents the total energy of the current candidate mutation combinations.

[0112] The preset acceptance probability is adjusted by the temperature parameter kT, which controls the exploration range of the candidate mutation combination space. The specific calculation formula is as follows:

[0113] Preset acceptance probability = .

[0114] S3-5. Repeat S3-3 to S3-4 for multiple iterations (forming a Monte Carlo cycle). After each iteration, record the candidate mutation combination with the lowest total energy and its total energy.

[0115] S3-6. When the iteration reaches the preset number of iterations, stop the iteration. Sort the candidate mutation combinations of each iteration in the record from low to high according to the total energy. Select the candidate mutation combinations with the lowest preset number or preset proportion of total energy as the optimal candidate mutation combination set.

[0116] S4. Perform molecular dynamics simulations on the mutants of each candidate mutation combination in the optimal candidate mutation combination set, and verify and determine the protein mutants with improved actual thermal stability through structural stability indices.

[0117] To further verify whether all candidate mutation combinations in the optimal candidate mutant set obtained in S3 meet the criteria for improved thermal stability and to avoid false positive mutants, this application conducted molecular dynamics (MD) simulation experiments. Furthermore, S4 includes the following sub-steps:

[0118] S4-1. Based on Newton's equation of motion, molecular dynamics simulations were performed on the mutants of the candidate mutation combinations in the optimal candidate mutation combination set to obtain the conformational evolution trajectory of the protein.

[0119] This application uses molecular dynamics simulations to dynamically assess the structural stability of mutants from each candidate mutation combination in the optimal candidate mutation combination set, thereby overcoming the limitations of static energy function-based assessments. The conformational changes of proteins in solution can be described by classical mechanics, and their motion satisfies Newton's equations of motion:

[0120] ;

[0121] Where, m i r represents the mass of the i-th atom. i Let r represent the coordinates of the i-th atom, and V(r) represent the system potential energy function.

[0122] By integrating the above equation over time, the conformational evolution trajectory of a protein under certain temperature conditions can be obtained.

[0123] S4-2. Calculate the root mean square deviation (RMSD) of the mutant and wild-type proteins for each candidate mutation combination in S4-1, and compare the RMSD values ​​of the mutant and wild-type proteins. If the RMSD value of the mutant is lower than that of the wild-type protein, then the RMSD value of the wild-type protein is lower than that of the mutant protein. mut Less than the RMSD value of wild-type protein WT If the candidate mutation combination is selected, the resulting mutant will be the protein mutant with improved thermal stability.

[0124] RMSD is calculated based on the deviation between the protein's atomic coordinates and the initial structural coordinates. The calculation formula is as follows:

[0125] RMSD(t) = ;

[0126] in, Indicates the initial structure coordinates. Let represent the coordinate of the i-th atom at time t, N represent the total number of atoms, and t represent time.

[0127] The smaller the RMSD, the more stable the protein structure; the smaller the fluctuation of RMSD, the more rigid the protein conformation.

[0128] Example 2

[0129] This application also designs a system for a Monte Carlo optimization protein thermal stability screening method that integrates evolutionary information, including a candidate mutation site screening module, a mutation combination generation module, a Monte Carlo optimization screening module, and a structural stability verification module.

[0130] The candidate mutation site screening module is configured to integrate protein sequence evolution information and energy prediction methods to complete the initial screening and secondary screening of candidate mutation sites, and output the final candidate mutation sites and the allowed mutation set of each candidate mutation site.

[0131] The mutation combination generation module is configured to perform reproducible random sampling based on a deterministic random seed, and generate a set of candidate mutation combinations for candidate mutation sites.

[0132] The Monte Carlo optimization screening module is configured to call the Rosetta PackRotamers+FastRelax algorithm for high-throughput energy calculation and execute the Monte Carlo iterative optimization process to screen out the optimal candidate mutation combination;

[0133] The structural stability verification module is configured to perform molecular dynamics simulations, calculate the RMSD stability index, and verify and screen protein mutants with improved actual thermal stability.

[0134] In addition, the system of this application is also equipped with an automated execution management module, which is configured to realize automatic scheduling of computing tasks, breakpoint resume, automatic result parsing and optimization feedback, and to build a closed-loop optimization screening system based on candidate mutation site screening.

[0135] For automatic task scheduling: The system automatically traverses the candidate mutation combination list, generates calculation tasks one by one, assigns an independent task directory to each candidate mutation combination, and calls the structural design program.

[0136] For resuming calculations from breakpoints: Before executing a calculation task, the system automatically checks if a result file exists. If it exists, the task is skipped; otherwise, the calculation is resubmitted.

[0137] For automatic result parsing: Automatically parse the energy score files output by all computation tasks and extract key indicators such as total energy, conformational energy, and interaction energy;

[0138] For optimization feedback: the energy screening results are fed back to the Monte Carlo optimization module to update the candidate mutation combination generation strategy based on candidate mutation sites, thus constructing a closed-loop optimization screening process.

[0139] Example 3

[0140] Based on Example 1 and / or Example 2, in order to verify the effectiveness of the method described in this application, this example takes a certain glycosyltransferase protein as the research object and carries out the following automated design and screening experiment.

[0141] In this embodiment, the target protein sequence was selected to be 457 amino acids long. Multiple sequence alignment (MSA) was constructed based on the homology sequence database, and the amino acid distribution probability and information entropy of each site were calculated to screen out a number of candidate mutation sites, totaling 1727.

[0142] For example, referring to Table 1, for proline (Pro, P) at position 2 in the wild-type (WT) sequence of the target protein, the amino acid distribution probability at this position was statistically analyzed by multiple sequence alignment, and the information entropy was calculated to determine the tolerance of this position to amino acid substitutions. The results showed that this position allows substitutions with aspartic acid (D), glutamic acid (E), glycine (G), methionine (M), and proline (P), and these amino acids constitute the allowed mutation set for this position. Similarly, asparagine (Asn, N) at position 3 allows substitutions with arginine (R), asparagine (N), glutamine (Q), glutamic acid (E), lysine (K), serine (S), and threonine (T), and these amino acids constitute the allowed mutation set for this position. Threonine (Thr, T) at position 4 allows substitutions with alanine (A), glutamine (Q), histidine (H), methionine (M), serine (S), and threonine (T), and these amino acids constitute the allowed mutation set for this position. The serine at position 5 (Ser, S) can be replaced by alanine (A), glutamic acid (E), glycine (G), serine (S), and threonine (T), and these amino acids constitute the allowed mutation set for this position. Similarly, the proline at position 6 (Pro, P) can be replaced by glutamic acid (E), glycine (G), lysine (K), proline (P), serine (S), and threonine (T), and these amino acids constitute the allowed mutation set for this position.

[0143] Table 1. Reference table for allowed mutations at some sites.

[0144]

[0145] Subsequently, the change in free energy (ΔΔG) before and after the mutation was calculated, and single-point mutations were screened using evolutionary conservation analysis to identify potential beneficial mutations. Specifically, mutations that met evolutionary acceptability and whose free energy change satisfied ΔΔG < -0.5 kcal / mol were screened, resulting in a final set of 32 allowable mutation sites for candidate mutations.

[0146] For example, referring to Table 2, the single-point mutations V34I represent the mutation of valine (Val, V) at position 34 to isoleucine (Ile, I); L378G represents the mutation of leucine (L) at position 378 to glycine (G); A107I represents the mutation of alanine (A) at position 107 to isoleucine (I); S304P represents the mutation of serine (S) at position 304 to proline (P); S388V represents the mutation of serine (S) at position 388 to valine (Valine); and H248M represents the mutation of histidine (H) at position 248 to methionine (M). These mutations were deemed acceptable in multiple sequence alignment analysis, and their free energy changes satisfied ΔΔG < -0.5. The kcal / mol value indicates that this mutation has the potential to improve protein stability.

[0147] Table 2 Reference Table for Differences in Free Energy from Single-Point Mutations

[0148]

[0149] Based on the allowed mutation set of candidate mutation sites, a constrained random sampling strategy is used to generate mutation combinations. In this embodiment, the number of candidate mutation sites k selected from set P is limited to 2-6, and the random seed is fixed at 42 to ensure the reproducibility of the results. A total of 500 mutation combinations are generated.

[0150] Each mutation combination is automatically converted into a structural design input file (resfile) for subsequent calculations.

[0151] In this embodiment, the side chain conformation optimization and main chain and side chain co-optimization are automatically performed on each candidate mutation combination, and the total energy of each candidate mutation combination is calculated.

[0152] Using the wild-type protein as a control, its energy was approximately -1381.2 kcal / mol. Several low-energy candidate mutants were ultimately screened.

[0153] Statistical analysis of the screened mutant combinations revealed that some sites showed a significantly increased frequency in low-energy mutants, indicating that these sites may be key sites affecting protein stability.

[0154] Molecular dynamics (MD) simulations were performed on the screened low-energy mutants under the following conditions: temperature 300 K, simulation time 100 ns, and explicit water solvent model.

[0155] The structural stability of wild-type proteins and optimal candidate mutants is as follows: Figure 2 As shown, the RMSD value of the wild-type protein is approximately 4.0 Å, while the RMSD value of the optimal candidate mutant is approximately 2.2 Å, indicating that the optimal candidate mutant has a smaller overall structural shift and higher stability.

[0156] The above description is merely an optional embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent structural transformations made using the content of the present invention under the concept of the present invention, or direct / indirect applications in other related technical fields, are included within the patent protection scope of the present invention.

Claims

1. A Monte Carlo optimization method for screening protein thermostability by incorporating evolutionary information, characterized in that, Includes the following steps: S1. Screen candidate mutation sites with potential thermal stability enhancement effects, and determine the allowable mutation set for each candidate mutation site; S1 includes the following sub-steps: S1-1. Perform multiple sequence alignment on the target protein and its homologous sequences, calculate the amino acid distribution probability at each site and obtain the information entropy H(i) of the site, construct the mutation tendency scoring function S(i,a) of the mutation tolerance and mutation evolution preference of the fusion site, and screen out sites with information entropy H(i) greater than the preset information entropy threshold and scores greater than the preset score threshold as preliminary candidate mutation sites. S1-2. Calculate the difference in folding free energy of single-point mutations for the initial screening candidate mutation sites in S1-1, and select sites with a free energy difference of less than 0 as the final candidate mutation sites. The formula for calculating the information entropy H(i) of the site in S1-1 is as follows: ; in, This represents the probability of amino acid a being observed at the i-th site in a protein sequence, where a represents the amino acid type. The expression for the mutation propensity scoring function S(i,a) is as follows: ; in, This indicates wild-type amino acids. This indicates that at the i-th site, the wild-type amino acid... The probability of occurrence of , where α and β are weight parameters, 0≤α≤1, 0≤β≤1, α+β=1; S2. Based on the candidate mutation sites determined in S1, construct a constrained mutation combination space, and generate a set of candidate mutation combinations through a reproducible random sampling strategy. S3. Obtain the energy score of each candidate mutation combination, and use the Monte Carlo search algorithm to iteratively optimize the candidate mutation combinations and select the optimal candidate mutation combination set. S4. Perform molecular dynamics simulations on the mutants of each candidate mutation combination in the optimal candidate mutation combination set, and verify and determine the protein mutants with improved actual thermal stability through structural stability indices.

2. The Monte Carlo optimization protein thermostability screening method based on fused evolutionary information according to claim 1, characterized in that, S2 includes the following sub-steps: S2-1. Let the set of candidate mutation sites be P = {p1, p2, ..., p...} i , ..., p N }, where N is the total number of candidate mutation sites, p i For the i-th candidate mutation site, any mutation combination C = {p i1 p i2 , ..., p ik } represents a subset of k candidate mutation sites selected from set P; S2-2. Introduce a deterministic random seed s, initialize the random number generator, and achieve the reproducibility of the sampling process; S2-3. Select the number of candidate mutation sites k using a random number generator, perform sampling without replacement in the candidate mutation site set P to generate a single candidate mutation combination C, repeat the sampling without replacement operation to generate a candidate mutation combination set containing M candidate mutation combinations.

3. The Monte Carlo optimization protein thermostability screening method according to claim 2, characterized in that, In S2-1, the value range of k is 2≤k≤6.

4. The Monte Carlo optimization protein thermostability screening method according to claim 1, characterized in that, S3 includes the following sub-steps: S3-1. Calculate the protein folding free energy corresponding to each candidate mutation combination in the candidate mutation combination set, and use it as the energy score of each candidate mutation combination. S3-2. Randomly select a candidate mutation combination from the candidate mutation combination set as the initial candidate mutation combination, and calculate the total energy of the initial candidate mutation combination; S3-3. Randomly select several candidate mutation sites in the current candidate mutation combination for adjustment, generate a new candidate mutation combination, and calculate the total energy of the new candidate mutation combination. S3-4. Determine whether to accept a new candidate mutation combination as the current candidate mutation combination. The acceptance rules are as follows: If the total energy of the new candidate mutation combination is less than the total energy of the current candidate mutation combination, then accept it directly; otherwise, accept it with the preset acceptance probability. S3-5. Repeat S3-3 to S3-4 for multiple rounds of iteration. After each round of iteration, record the candidate mutation combination with the lowest total energy and its total energy. S3-6. When the iteration reaches the preset number of iterations, stop the iteration. Sort the candidate mutation combinations of each iteration in the record from low to high according to the total energy. Select the candidate mutation combinations with the lowest preset number or preset proportion of total energy as the optimal candidate mutation combination set.

5. The Monte Carlo optimization protein thermostability screening method according to claim 4, characterized in that, The total energy of the initial candidate mutation combination and the total energy of the new candidate mutation combination are equal to the overall folding free energy of the protein corresponding to the candidate mutation combination.

6. The Monte Carlo optimization protein thermostability screening method according to claim 5, characterized in that, In S3-4, the preset acceptance probability is adjusted by the temperature parameter kT, and the formula for calculating the preset acceptance probability is: Preset acceptance probability = ; Among them, E current This represents the total energy of the current candidate mutation combinations. This represents the total energy of the new candidate mutation combinations.

7. The Monte Carlo optimization protein thermostability screening method according to claim 1, characterized in that, S4 includes the following sub-steps: S4-1. Based on Newton's equation of motion, molecular dynamics simulations were performed on the mutants of each candidate mutation combination in the optimal candidate mutation combination set to obtain the conformational evolution trajectory of the protein. S4-2. Calculate the root mean square offset (RMSD) of the mutant and wild-type protein for each candidate mutation combination in S4-1. Compare the RMSD values ​​of the mutant and wild-type. If the RMSD value of the mutant is less than the RMSD value of the wild-type protein, then the mutant of the candidate mutation combination is the final protein mutant with improved thermal stability.

8. The Monte Carlo optimization protein thermostability screening method according to claim 7, characterized in that, The formula for calculating the root mean square offset (RMSD) is: RMSD(t)= ; in, Indicates the initial structure coordinates. Let represent the coordinate of the i-th atom at time t, N represent the total number of atoms, and t represent time.