Method and system for efficient screening of high activity isozymes
By employing progressive hierarchical clustering and molecular simulation, the problems of sequence redundancy and functional misannotation in the screening of highly active isoenzymes were solved, achieving efficient screening and reducing experimental costs, and providing highly active enzymes for industrial applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INST OF AGRO FOOD SCI & TECH CHINESE ACADEMY OF AGRI SCI
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies face challenges in screening highly active isoenzymes, including high sequence redundancy, high risk of functional misannotation, and the trade-off between sequence diversity and functional relevance. These challenges make it difficult to achieve efficient screening and reduce experimental costs within a wide similarity range.
A progressive hierarchical clustering method combined with molecular simulation was adopted. The clustering threshold was gradually reduced through multiple rounds of clustering. The sequences of adjacent clusters were systematically sampled, and the binding free energy of candidate enzymes was evaluated by combining molecular simulation to screen out highly active isoenzymes.
It significantly improves the efficiency of discovering highly active enzymes, reduces experimental costs, enhances the functional reliability of candidate enzymes, shortens the screening cycle, and provides high-value-added enzyme elements for industrial applications.
Smart Images

Figure CN122201503A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of bioenzyme engineering technology. More specifically, this invention relates to a method and system for efficiently screening highly active isoenzymes. Background Technology
[0002] Isozymes are enzyme proteins that catalyze the same chemical reaction (i.e., have the same EC number at the first three or four levels of classification) but differ in their primary amino acid sequences. They have irreplaceable core value in industries such as industrial catalysis, biopharmaceuticals, environmental remediation, and biomanufacturing. Specifically, isozymes include orthologous enzymes (enzymes originating from different species, vertically inherited from a common ancestral gene, and retaining the same catalytic function), paralogous enzymes (enzymes produced within the same species through gene replication events and retaining catalytic activity on the same substrates after sequence differentiation), and isoheterologous enzymes (enzymes that have acquired the same catalytic function through convergent evolution but have no direct homology).
[0003] However, the acquisition of high-performance isoenzymes has long faced screening challenges: on the one hand, traditional methods of isolating microorganisms from nature and verifying enzyme activity are extremely inefficient, with research and development cycles lasting several years and a success rate of less than 1%, which is difficult to meet the needs of the rapidly developing biomanufacturing industry; on the other hand, with the explosive growth of genome sequencing data, a massive number of uncharacterized sequences have accumulated in protein sequence databases. How to accurately identify novel enzymes with both high activity and high stability from hundreds of thousands or even millions of candidate sequences has become a key bottleneck restricting downstream industrial applications.
[0004] To address this challenge, bioinformatics retrieval methods based on sequence similarity (such as BLAST, PSI-BLAST, and HMMER) have been widely used for homology enzyme mining. However, current technological practices show that relying solely on sequence similarity for homology retrieval has serious drawbacks. A systematic study on enzyme functional conservation found that the indicative role of sequence similarity in enzyme function is severely overestimated: even among homologous pairs with sequence identity greater than 50%, less than 30% have completely identical enzymatic functions (based on complete EC number consistency); more seriously, even with BLASTE values below 10... -50 Even with stringent thresholds, it is still insufficient to ensure the accurate automatic transfer of enzyme function. This means that many candidate enzymes obtained based on high sequence similarity screening are actually at risk of functional annotation errors, while distant homologous enzymes with low similarity but conserved function or even superior activity are systematically missed. In other words, traditional methods essentially linger in a narrow local space of known high-similarity sequences, making it difficult to break through the constraints of the local catalytic activity space.
[0005] Specifically, existing isoenzyme screening methods based on sequence similarity suffer from at least three significant technical drawbacks: First, they suffer from extremely high sequence redundancy and a severe lack of originality. Taking non-ribosomal peptide synthases as an example, a single PSI-BLAST homology search can yield 15,000 to 76,000 candidate sequences, and even after simple redundancy removal, over a thousand sequences remain on the verification list. These sequences are highly homologous to the seed enzyme, exhibiting extremely low sequence diversity, resulting in a severe imbalance between the workload and output efficiency of experimental verification. Second, the risk of functional misannotation remains high. Studies have shown that homology searches based on short fragment similarity are highly prone to introducing non-homologous or paralogous sequences. These sequences do not possess the target catalytic function but are misclassified as functional homologs due to their high similarity, leading to 10% to 30% errors in genome annotation. Third, the trade-off between sequence diversity and functional relevance remains unresolved. If the similarity threshold is relaxed to increase sequence diversity, a large number of functionally irrelevant sequences will inevitably be mixed in, increasing the burden of subsequent calculations and experiments; if the threshold is tightened to ensure functional relevance, the screening range will be compressed to homologous close relatives near the seed enzyme, making it difficult to discover novel enzyme variants with disruptive activity enhancements.
[0006] To alleviate these problems, researchers have recently attempted to introduce structural similarity filtering or virtual screening post-processing strategies. For example, they have combined AlphaFold predicted structures with TM-score structures to simplify the candidate enzyme set; or used methods such as MM / GBSA to re-score the binding free energy of docking conformations. However, these methods are all passive filtering performed after homology searches have been completed and redundant candidate sets have been obtained, and they do not address the fundamental contradiction between sequence space traversal and functional relevance assurance from the top-level design of the screening strategy. Moreover, methods such as MM / GBSA require fully balanced explicit solvent systems, resulting in high computational costs and making them difficult to implement directly on hundreds of thousands of candidate sequences. Meanwhile, enzyme engineering modification methods based on rational design or directed evolution can optimize enzyme activity within the local sequence space, but the construction and screening of their iterative mutant libraries still heavily rely on high-throughput experimental platforms and are prone to getting trapped in local catalytic activity optima, making it difficult to escape the known sequence framework.
[0007] In summary, it remains difficult to establish an intelligent screening method in the initial stage of homologous sequence retrieval that can systematically traverse a wide similarity range, take into account both functional reliability and sequence diversity, and significantly reduce the burden of subsequent computation and experimental verification. Furthermore, traditional methods, due to the lack of a structured sampling mechanism for the homologous sequence space, have long been trapped in the dilemma of high similarity redundancy and low similarity out of control, making it difficult to achieve efficient discovery of high-activity and high-stability isozymes. Summary of the Invention
[0008] One object of the present invention is to solve at least the above-mentioned problems and to provide at least the advantages that will be described later.
[0009] Another objective of this invention is to provide a method for efficiently screening highly active isoenzymes. This method uses progressive hierarchical clustering to focus on seed clusters in rounds and systematically sample adjacent clusters, achieving a comprehensive traversal of sequence space within a wide similarity range of 20% to 90%. It takes into account both functional relevance and diversity, overcomes the limitations of high similarity, significantly improves the efficiency of high-activity enzyme discovery, and reduces experimental costs.
[0010] To achieve these objectives and other advantages according to the present invention, a method for efficiently screening highly active isoenzymes is provided, comprising: S1. Using at least one seed enzyme sequence with a known function as the query sequence, perform homologous sequence retrieval and preliminary screening to obtain a preliminary set of filtered sequences after length and global sequence similarity. S2. Perform multiple rounds of progressive hierarchical clustering on the preliminary screening sequence set to obtain a candidate enzyme sequence set; S3. Perform a performance evaluation of the sequences in the candidate enzyme sequence set based on molecular simulation, and screen out candidate sequences with binding free energy superior to the seed enzyme as priority targets for experimental verification of highly active isoenzymes. In step S2, the multi-round progressive hierarchical clustering includes: Each round of clustering uses a different clustering threshold, and the clustering threshold decreases in each round: After each round of clustering, only the sequences within the cluster containing the seed enzyme sequence are passed to the next round of clustering; Except for the last round, after each round of clustering, the central sequence of the adjacent clusters of the cluster containing the seed enzyme sequence is extracted as the candidate enzyme sequence. However, in the first round of clustering, for clusters whose number of sequences in the adjacent clusters of the cluster containing the seed enzyme sequence exceeds the preset size, they are re-clustered separately under the condition of being below the threshold of the first round of clustering, and their central sequences are extracted as the candidate enzyme sequences. In the final round of clustering, the central sequences of all clusters are extracted as candidate enzyme sequences; All candidate enzyme sequences and sequences within the cluster of the first-round seed enzyme sequences are merged to form a candidate enzyme sequence set.
[0011] In the above technical solution, the adjacent clusters refer to one or more clusters that are directly adjacent to the cluster containing the seed enzyme sequence in the resulting clustering tree diagram after hierarchical clustering of the sequence set. The adjacent clusters are determined by one or a combination of the following methods: First, based on the clustering tree topology, when traversing from the root node to the leaf node in the clustering tree, clusters directly connected to the parent or sibling node of the cluster containing the seed enzyme sequence are considered adjacent clusters; second, based on a clustering distance threshold, if the sequence similarity or distance metric between the central sequences of two clusters is within a preset proximity interval under a given clustering threshold, they are considered adjacent clusters. Clusters with more than a preset size of sequences among the adjacent clusters of the seed enzyme sequence can be called large adjacent clusters. Generally, adjacent clusters with more than 20-40% of the total number of sequences in the initial screening sequence set are called large adjacent clusters.
[0012] Preferably, the number of rounds of the multi-round progressive hierarchical clustering is at least 3 rounds.
[0013] Preferably, the preliminary screening in step S1 includes: A length screening range is set based on the amino acid length of the seed enzyme sequence, and sequences whose length exceeds the length screening range are removed. The length screening range is ±10%~20% of the amino acid length of the seed enzyme sequence. A similarity screening range is set based on global sequence similarity, and sequences with similarity to the seed enzyme sequence within the similarity screening range are retained. The similarity screening range is 20% to 90%. For sequences from the same species, only the sequence with the highest similarity to the seed enzyme sequence is retained.
[0014] Preferably, the homology sequence search in step S1 is performed using the HMMER program from the Uniref90 protein sequence database.
[0015] Preferably, the performance evaluation based on molecular simulation in step S3 specifically includes: S31. Perform protein three-dimensional structure prediction on all candidate enzyme sequences in the candidate enzyme sequence set; S32. Perform molecular docking between the predicted three-dimensional structures of each candidate enzyme sequence protein and the target substrate to obtain the preliminary complex structure of each candidate enzyme sequence-substrate. S33. Molecular dynamics simulations were performed to optimize the preliminary structure of the enzyme sequence-substrate complex for each candidate enzyme; S34. Based on the optimized structure of each candidate enzyme sequence-substrate complex, the binding free energy of each candidate enzyme sequence to the substrate is calculated using the free energy calculation method.
[0016] Preferably, in step S31, AlphaFold3 is used to predict the three-dimensional structure of the protein; in step S32, AutoDOCKVina is used for molecular docking; in step S33, GROMACS is used for molecular dynamics simulation; and in step S34, the free energy calculation method is the MM / GBSA method.
[0017] Preferably, between steps S32 and S33, a conservation analysis and structural rationality assessment are performed on the preliminary structure of each candidate enzyme sequence-substrate complex. Candidate enzyme sequences with mutations or deletions in key catalytic sites, deletions in key domains, or structural abnormalities are eliminated to obtain reasonable preliminary structure of candidate enzyme sequence-substrate complex. The conservation analysis is performed by analyzing the conservation of known key catalytic amino acid sites of the seed enzyme in the candidate enzyme sequence through multiple sequence alignment. The structural rationality assessment is performed by analyzing the integrity of known key domains of the seed enzyme in the predicted structure of the candidate enzyme sequence through three-dimensional structural visualization. For candidate enzyme-substrate complex structures that do not meet the structural rationality assessment, after molecular dynamics simulation optimization in step S33, step S32 is re-executed to obtain a new candidate enzyme-substrate complex structure, and then step S34 is executed.
[0018] Preferably, the preliminary structure of the reasonable candidate enzyme sequence-substrate complex is determined by comparing it with the complex conformation of reported isoenzymes. For enzymes for which no isoenzymes have been reported, the following requirements must be met: ensuring catalytic mechanism compatibility, meaning that the spatial conformation of the ligand for the enzyme active site (key catalytic amino acid) cannot violate the known catalytic mechanism; ensuring conformational flexibility to induce fit, as ligand binding is often accompanied by protein conformational changes and cannot rely entirely on the static structure of the unbound state; and assessing whether the movement of side chains or loop regions has been reasonably considered.
[0019] Preferably, the method for efficiently screening highly active isoenzymes further includes experimentally verifying the high-activity isoenzymes screened in step S3, confirming their relative catalytic activity by enzyme activity assay, and obtaining highly active isoenzymes.
[0020] This invention further claims a system for efficiently screening highly active isoenzymes, comprising: The homology retrieval and preliminary screening module is used to perform homology sequence retrieval and preliminary screening using at least one seed enzyme sequence with a known function as the query sequence, and to obtain a preliminary screening sequence set after filtering by length and global sequence similarity. A progressive hierarchical clustering module is used to perform multiple rounds of progressive hierarchical clustering on the preliminary screened sequence set to obtain a candidate enzyme sequence set; wherein: Each round of clustering uses a different clustering threshold, and the clustering threshold decreases in each round. After each round of clustering, only the sequences within the cluster containing the seed enzyme sequence are passed to the next round of clustering; Except for the last round, after each round of clustering, the central sequences of the clusters adjacent to the cluster containing the seed enzyme sequence are extracted as candidate enzyme sequences; During the first round of clustering, for clusters whose number of sequences in adjacent clusters of the seed enzyme sequence exceeds a preset size, they are re-clustered separately under the condition of being below the first round of clustering threshold, and their central sequences are extracted as candidate enzyme sequences. In the final round of clustering, the central sequences of all clusters are extracted as candidate enzyme sequences; Merge all candidate enzyme sequences and sequences within the cluster of the first-round seed enzyme sequences to output a set of candidate enzyme sequences; The molecular simulation evaluation module is used to perform performance evaluation of the sequences in the candidate enzyme sequence set based on molecular simulation, and to obtain the binding free energy of each candidate enzyme sequence to the target substrate. The screening output module is used to screen out candidate sequences with binding free energy superior to the seed enzyme, which are the priority targets for experimental verification of highly active isoenzymes. The molecular simulation evaluation module includes: The structure prediction unit is used to predict the three-dimensional protein structure of all candidate enzyme sequences in the candidate enzyme sequence set. The molecular docking unit is used to perform molecular docking between the predicted three-dimensional structures of each candidate enzyme sequence protein and the target substrate to obtain the preliminary complex structure of each candidate enzyme sequence-substrate. The kinetics simulation unit is used to perform molecular dynamics simulations to optimize the preliminary structure of the enzyme sequence-substrate complex for each candidate enzyme; Combined with a free energy calculation unit, it is used to calculate the binding free energy between each candidate enzyme sequence and substrate based on the optimized candidate enzyme sequence-substrate complex structure using a free energy calculation method.
[0021] The present invention further claims an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method for efficiently screening highly active isoenzymes.
[0022] The present invention further claims a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for efficiently screening highly active isoenzymes.
[0023] The present invention has at least the following beneficial effects: Firstly, this invention combines a 20%–90% wide similarity initial screening with progressive hierarchical clustering. In each round, only the seed cluster is retained and adjacent clusters are systematically sampled. While traversing the homologous sequence space, functional correlation is precisely controlled, breaking through the limitations of traditional high similarity, increasing sequence diversity by more than 5 times, and significantly improving the probability of discovering innovative enzymes. Secondly, this invention integrates AlphaFold3 batch structure prediction, AutoDOCKVina docking, and MM / GBSA free energy calculation. Through molecular simulation, it greatly reduces the number of sequences to be verified in experiments, reduces wet experiment costs by 90%, and shortens the screening cycle by 2 / 3. Thirdly, this invention introduces a dual filtering mechanism based on the conservation analysis of key catalytic sites and the visualization of structural domains such as Y62 loop, which accurately eliminates mutated or structurally abnormal sequences, avoids ineffective experimental input, and improves the functional reliability of candidate enzymes. Fourth, this invention constructs a closed-loop system for the entire chain of computational screening, binding energy ranking, and enzyme activity experiment, ensuring that the final output of highly active enzymes has real catalytic advantages and can be directly used for the construction of industrial strains, providing high-value-added enzyme elements for pharmaceutical, environmental protection and other fields.
[0024] Other advantages, objectives and features of the present invention will become apparent in part from the following description, and in part from those skilled in the art through study and practice of the invention. Attached Figure Description
[0025] Figure 1 In one embodiment of the present invention, the first clustering yielded 8 clusters at a cutoff value of 180. Figure 2 This refers to the eight clusters obtained in the first clustering operation in one embodiment of the present invention at a cutoff value of 120. Figure 3 These are the 10 clusters obtained from the second clustering in one embodiment of the present invention; Figure 4 These are the six clusters obtained from the third clustering in one embodiment of the present invention; Figure 5 These are the five clusters obtained from the fourth clustering in one embodiment of the present invention; Figure 6 This is a visualization analysis diagram of the sequence structure in one embodiment of the present invention; Figure 7 This is a bar chart showing the relative activity of five genes superior to VvCura in one embodiment of the present invention. Detailed Implementation
[0026] The present invention will now be described in further detail with reference to specific embodiments, so that those skilled in the art can implement it based on the description.
[0027] It should be understood that terms such as “having,” “comprising,” and “including” as used herein do not exclude the presence or addition of one or more other elements or combinations thereof.
[0028] Example: Screening for highly active tetrahydrocurcumin reductase 1. Similarity sequence retrieval Using two previously reported tetrahydrocurcumin reductase seed sequences as a search, one of which was from Escherichiacoli (strain K12) (named EcCura, accession: P76113), its amino acid sequence is: MGQQKQRNRRWVLASRPHGAPVPENFRLEEDDVATPGEGQVLLRTVYLSLDPYMRGRMSDEPSYSPPVDIGGVMVGGTVSRVVESNHPDYQSGDWVLGYSGWQDYDISSGDDLVKLGDHPQ NPSWSLGVLGMPGFTAYMGLLDIGQPKEGETLVVAAATGPVGATVGQIGKLKGCRVVGVAGGAEKCRHATEVLGFDVCLDHHADDFAEQLAKACPKGIDIYYENVG GKVFDAVLPLLNTSARIPVCGLVSSYNATELPPGPDRLPLLMATVLKKRIRLQGFIIAQDYGHRIHEFQREMGQWVKEDKIHYREEITDGLENAPQTFIGLLKGKNF GKVVIRVAGDD; another one comes from Vibriovulnificus (strain MO6-24 / O) (named VvCura, accession: A0A4V8GZL4), and its amino acid sequence is: ARQIVLASRPVGAPTAENFALTQSDIPTPAQGEMLLRSVYLSLDPYMRGRMSDAKSYAEPVGIDEVMVGGTVCQVEASNHAEFEVGEWVLAYTGWQDYAISDGEGLIKLGKQPSHPSYALGVMG MPGFTAYMGLLDIGQPKEGDTLVVAAATGAVGSMVGQIGKLKGCRVIGIAGGEEKCQFAKDTLGFDECIDHKAADFAEQLAKVCHNGIDIYFENVGGKVFDAVMPLL NTGARIPLCGLISQYNATSLPEGPDRMSMLMAQLLIKRIKMQGFIIFDDYGHRYGEFAADMTQWLAQGKIHYREHLVQGLENAPDAFIGLLEGKNFGKMVVQTNQP.
[0029] Using the HMMER 3.3.2 program and these two seed sequences as search probes, a similarity sequence search was performed in the Uniref90 database, yielding a total of 137,020 similar sequences. The HMMER program, based on a Hidden Markov Model (HMM), can efficiently and accurately screen protein databases for sequences homologous to the seed sequences. Its core principle is to construct a ProfileHMM of the seed sequences to describe the characteristics of conserved and variable regions, thereby scoring and filtering sequences in the database and retaining those that meet the homology criteria.
[0030] 2. Preliminary Sequence Filtering To remove redundant sequences and sequences that do not conform to the basic characteristics of tetrahydrocurcumin reductase, the 137,020 sequences retrieved were filtered based on the following three criteria: (1) Amino acid length screening: Based on the amino acid lengths of the two seed sequences (both around 350), the screening range was set to AA300-400. The basis for setting this range is that the functional domains of enzymes usually have a relatively stable length range. Sequences exceeding this range may lack key functional structures or contain irrelevant domains, thereby affecting enzyme activity.
[0031] (2) Sequence similarity screening: Sequences with a similarity of 20% to 90% with the seed sequence are retained. Sequences with a similarity of less than 20% have too low homology with the seed sequence and are unlikely to have the function of tetrahydrocurcumin reductase; sequences with a similarity of more than 90% have high redundancy and lack diversity, which is not conducive to screening out enzyme sequences with novel and high activity.
[0032] (3) Source uniqueness screening: For sequences from the same species, only the sequence with the highest similarity to the seed sequence is retained to avoid over-enrichment of sequences from the same species and improve screening efficiency.
[0033] After the above filtering, a total of 5750 sequences were obtained, laying the foundation for subsequent cluster analysis.
[0034] 3. Multi-round progressive hierarchical clustering A four-round iterative hierarchical clustering method was used to further filter the 5750 sequences. The core idea of this stacked clustering strategy is: after each round of hierarchical clustering, the cluster containing the seed sequence and its neighboring clusters are re-clustered according to a set cutoff value; simultaneously, the center sequences of each cluster in each round are counted as candidate validation sequences. It should be noted that the center sequence of the cluster containing the seed sequence is not counted before the final round because it needs to enter the next round of clustering. Furthermore, after the initial clustering, clusters adjacent to the seed sequence and with a large number of sequences are re-clustered separately to avoid missing potentially high-value sequences due to coarse clustering granularity. The specific steps are as follows: (1) First clustering: Set the cutoff value to 180 and perform hierarchical clustering on all sequences, such as Figure 1 A total of 8 clusters were obtained. The center sequences of clusters 1 and 2 were selected as candidate sequences, resulting in two candidate sequences: uniref90_k6wz18 and uniref90_a0a524jp42. The seed sequences are located in clusters 3 and 4, and these sequences are retained for the second clustering. Clusters 5-8 are clusters adjacent to the seed sequences and with a large number of sequences (more than 1725 sequences). Considering that the clustering results are relatively coarse when cutoff=180, which may mask high-value sequences, the sequences of these four clusters were merged, as shown below. Figure 2 Individual clustering was performed with a cutoff value of 120, resulting in 8 clusters. The center sequences of each cluster were selected as candidate sequences, namely uniref90_a0a6j3d4t8, uniref90_a0a6a6jlu0, uniref90_a0a1w5d4m7, uniref90_upi00063f4c38, uniref90_a0a7j7c3p4, uniref90_a0a0c9qi54, uniref90_g3tcp8, and uniref90_a0a8d3a317.
[0035] (2) Second clustering: For the sequences of cluster3 and cluster4 retained in the first clustering, hierarchical clustering is performed with a cutoff value of 90, such as... Figure 3A total of 10 clusters were obtained. The seed sequences were located in clusters 1 and 2, and these sequences were retained for the third clustering. The central sequences of clusters 3-10 were selected as candidate sequences, resulting in 8 candidate sequences: uniref90_upi000b4a7dea, uniref90_a0a328zcc6, uniref90_a0a365u866, uniref90_a0a258les4, uniref90_a0a3m1fd94, uniref90_a0a1h0ewy5, uniref90_upi0022703255, and uniref90_upi00053ab2d9.
[0036] (3) Third clustering: For the sequences of cluster1 and cluster2 retained in the second clustering, a cutoff value of 60 is set for clustering, such as... Figure 4 A total of 6 clusters were obtained. The seed sequence is located in cluster 4, and the sequence of this cluster is retained until the fourth clustering. The center sequences of clusters 1-3 and 5-6 are selected as candidate sequences, resulting in 5 candidate sequences, namely uniref90_upi001ff4fe89, uniref90_upi000e253540, uniref90_upi000831b258, uniref90_a0a653bla3, and uniref90_upi000be6002a.
[0037] (4) Fourth clustering: For the cluster4 sequence retained in the third clustering, a cutoff value of 40 is set for the final clustering, such as... Figure 5 A total of 5 clusters were obtained. The central sequence of all clusters was selected as candidate sequences, and finally 5 candidate sequences were obtained, namely uniref90_a0a977iyz3, uniref90_a0a381nl70, uniref90_upi0022322023, uniref90_a0a1m5yq67 and uniref90_upi001c45f2c2.
[0038] After four levels of hierarchical clustering analysis, a total of 30 sequences to be verified were obtained (including two seed sequences). These sequences are evolutionarily representative and diverse, providing a high-quality sequence library for subsequent structural and functional analysis.
[0039] 4. Protein structure prediction and molecular docking (1) Protein structure prediction: Batch protein structure prediction was performed on 30 sequences to be verified using AlphaFold3 (https: / / alphafoldserver.com / ). AlphaFold3 is based on deep learning algorithms and combines co-evolutionary information of amino acid sequences, physicochemical properties and known protein structure databases to predict the three-dimensional structure of proteins with high accuracy. The confidence of the prediction results is evaluated by the pLDDT (predicted Local Distance Difference Test) value. The higher the pLDDT value, the stronger the reliability of the structure prediction.
[0040] (2) Molecular docking: The predicted protein structures were batch-docked with the substrate curcumin using AutoDOCK Vina 1.1.2 software to construct protein-substrate complex structures. AutoDOCK Vina is a highly efficient molecular docking software that predicts the optimal binding conformation of the ligand in the receptor's active pocket by calculating the non-covalent interaction energies such as hydrophobic interactions, hydrogen bonds, and electrostatic interactions between the ligand (curcumin) and the receptor (enzyme protein). During the docking process, the protein structure was first preprocessed (water molecules were removed, hydrogen atoms were added, and charges were distributed), and the curcumin structure was optimized and its charge was calculated. Then, the docking box range (covering the active pocket region of the protein) was set, and batch docking calculations were performed to obtain the optimal docking conformation of each protein with curcumin.
[0041] 5. Sequence conservation and structural visualization analysis Candidate biological sequences are further screened through sequence conservation analysis and structural feature visualization analysis. Specifically: based on the known catalytic mechanism of the seed sequence, conserved amino acid sites crucial to enzyme catalytic function are identified. Multiple sequence alignment is performed on candidate sequences to analyze the conservation of these key sites. If multiple key sites in a candidate sequence are mutated or deleted, it is determined that the sequence does not possess the expected catalytic function and is eliminated. Simultaneously, based on the known three-dimensional structural features of the seed sequence, key structural domains affecting substrate binding and catalytic efficiency are identified. The predicted protein structures corresponding to the candidate sequences are visualized. If the key structural domains are missing or exhibit significant structural abnormalities, the candidate sequence is determined to be a non-target sequence and is eliminated. This process yields a set of target sequences that meet both sequence and structural constraints. (1) Sequence conservation analysis is shown in Table 1: Based on the reported catalytic mechanism of tetrahydrocurcumin reductase, the key catalytic amino acids (Arg53, Arg55, Ser61, Tyr62, and Tyr251) of the seed sequence VvCura are crucial to the enzyme's catalytic activity. The high conservation of these amino acids is a prerequisite for the enzyme to possess catalytic function. Multiple sequence alignment was performed on 30 sequences to be verified using the sequence alignment tool ClustalX to analyze the conservation of the aforementioned key amino acids. If multiple mutations or deletions occur in these key amino acids in the sequence, the sequence is highly likely to lack tetrahydrocurcumin reductase activity and should be removed.
[0042] Table 1 Sequence Conservatism Analysis Table Note: Afrotheria, African mammal (superorder); Fopius, genus Fopius (family Ichthyophthiriidae); Kineosphaera, genus Kineosphaera; Carangaria, group Carangaria; Thermodesulfobacteriales, order Thermodesulfobacteriales; A, alanine; L, leucine; I, isoleucine; F, phenylalanine; P, proline; R, arginine; G, glycine; √ indicates that the corresponding amino acid of the reference sequence is retained at this site; × indicates that the reference amino acid is not retained at this site.
[0043] (2) Structural visualization analysis, such as Figure 6 According to relevant literature on the crystal structure of the seed sequence VvCura, the presence or absence of its key structural domain Y62loop directly affects the enzyme's binding ability to the substrate and its catalytic efficiency. Figure 6 The structure of VvCurA (blue) is compared with the excluded sequence G3TCP8 (green). The loop containing Tyr62 is highlighted in pink. The difference in this region indicates the absence of the conserved substrate recognition loop in G3TCP8. The predicted protein structure is analyzed using PyMOL structural visualization software to observe the integrity of the Y62 loop. If the Y62 loop is missing or structurally abnormal in the protein structure corresponding to the sequence, the sequence is judged as an unexpected sequence and is removed.
[0044] Through the above two analyses, a total of 5 unexpected sequences (uniref90_g3tcp8, uniref90_a0a0c9qi54, uniref90_k6wz18, uniref90_a0a8d3a317 and uniref90_a0a524jp42) were removed, leaving 25 candidate sequences (including two seed sequences).
[0045] 6. Dynamic simulation optimization and re-docking During the molecular docking results review phase, it was found that some enzyme structures exhibited unreasonable binding conformations with the curcumin substrate (using the reported PDB:5ZXU structure as a reference standard; PDB:5ZXU is the crystal structure of a known active tetrahydrocurcumin reductase complex with its substrate). For these protein-substrate complexes with unreasonable conformations, simple kinetic simulations were performed for optimization (using GROMACS software, simulation duration 20 ns). By simulating protein movement in the solution environment, the protein conformation was adjusted, and cluster analysis of protein trajectory conformations was conducted. After optimization, these complexes were re-docked using AutoDOCK Vina to obtain relatively reasonable protein-substrate binding conformations, ensuring the accuracy of subsequent binding energy calculations.
[0046] 7. Molecular dynamics simulations and binding energy calculations Using the same simulation method described above, batch molecular dynamics simulations were performed on 25 complex groups. After the systems reached equilibrium, the binding free energy was calculated using the GB model in the gmx_MMPBSA tool. The basic principle is as follows: The formula for calculating free energy is: ΔG bind =G complex -(G protein +G ligand ); where ΔG bind This represents the change in free energy between a protein and its ligand; the larger the absolute value, the more stable the binding. (G) complex G represents the total free energy of the protein-ligand complex. protein G represents the free energy of a protein in its unbound state. ligand This represents the free energy of the ligand in its unbound state.
[0047] Within the MM / GBSA framework, the free energy of each system consists of three parts: G=E MM +G solvation -TS; where E MM Molecular mechanical energy, including electrostatic energy E ele And van der Waals can E vdw E MM =E ele +E vdw Van der Waals energy can be used to calculate the Lennard-Jones term of non-bonded interactions between atoms through molecular force fields, while electrostatic energy can be calculated using Coulomb's law combined with force field parameters. These energies are directly extracted from the trajectories in molecular dynamics simulations, calculated separately for complexes, proteins, and ligands, and the average value of the stable conformation at the end of the trajectory is extracted. solvation The solvation free energy includes the polar solvation energy G. polar Nonpolar solvation energy G nonpolar G solvatio =G polar +Gnonpolar The polar solvation energy is calculated using the Generalized Born (GB) model, which treats the solvent as a continuous medium and approximates the hydration free energy by solving the Poisson-Boltzmann equation, reflecting the interaction between polar atoms and water. The nonpolar solvation energy is typically estimated using the solvent's accessible surface area (SASA): G nonpolar =γ·SASA+b, where γ is the surface tension coefficient, b is a constant representing the hydrophobic effect, and TS is the entropy term, where T is the temperature and S is the entropy change of the system (including conformational entropy, rotational entropy, and translational entropy). Calculating the entropy term typically requires canonical mode analysis or quasi-harmonic approximation, which is computationally very expensive. In the relative comparison of MM / GBSA (e.g., screening candidate enzymes), the entropy term is often ignored, assuming its contribution to the system is similar, and only ΔE is used. MM +ΔG solvation As an approximation of the binding free energy, it is used for ranking different candidate sequences. Therefore, the final expression for calculating the binding energy is: ΔG bind =ΔE vdw+ ΔE vdw+ ΔG polar+ ΔG nonpolar The binding free energy of each candidate sequence was obtained by statistical averaging of multiple conformations in the later stage of the equilibrium trajectory.
[0048] 8. Experimental verification of sequence screening Based on the calculation results of the binding energy between the protein and the target substrate, the candidate protein sequences obtained through the aforementioned screening are further evaluated. Seed sequences known to have catalytic activity are used as references to compare and analyze the binding energies between the candidate proteins and the substrates. When the binding energy between a candidate protein and its substrate is superior to that of the seed sequence, the candidate protein is determined to have stronger substrate binding ability and higher potential catalytic activity, and is retained as the target sequence. By setting a binding energy preference criterion, candidate sequences that do not meet the criteria are eliminated, thereby obtaining a set of target protein sequences with preferred substrate binding characteristics. Specifically, according to the binding energy calculation results shown in Table 2, sequences whose binding energy between the protein and the substrate curcumin is superior to the two seed sequences (EcCura and VvCura) are selected as reference sequences.
[0049] Table 2. Binding energy between proteins and target substrates Note: Halobium; Marine; Desulfofustis; Allochromatium; Tripterygion; Gadus; Tropicomonas; Enterobium; E. *Escherichia coli*; *Katatospora*; *Rhodobacterales*; *Celulibacter*; *Lasallia*; *Paracoccus*; *Westerdykella*; *Mangrovicella*; *Vibrio*; *Acidocella*; *Rhodosalinus*; *Aeribacillus*; *Actinokineospor*; *Alkalimarina*; *Athyha*; *Rhodomicrobium*.
[0050] As shown in Table 2, the lower the binding energy (the larger the negative value), the stronger the binding ability between the protein and the substrate, and the higher the potential catalytic activity of the enzyme. Finally, 15 candidate sequences were screened, namely: uniref90_upi00053ab2d9, uniref90_a0a977iyz3, uniref90_upi000831b258, uniref90_a0a1w5d4m7, uniref90_a0a328zcc6, uniref90_a0a6a6jlu0, uniref90_upi000be6002a, uniref90_se edvibrio, uniref90_a0a258les4, uniref90_a0a365u866, uniref90_a0a653bla3, uniref90_upi00063f4c 38. uniref90_a0a1h0ewy5, uniref90_upi0022322023, uniref90_a0a6j3d4t8, uniref90_upi000b4a7dea.
[0051] The activity of tetrahydrocurcumin reductase was assessed by spectrophotometric determination of the NADPH reduction reaction. The experiment was conducted at room temperature. Tetrahydrocurcumin reductase protein was pre-incubated with NADPH in the dark for 10 min, followed by the addition of curcumin to initiate the reaction. The total reaction volume was 200 μL, containing 100 mM phosphate buffer (pH 6.0), 5% DMSO, 150 μM NADPH, 5 μM curcumin, and 50 nM reductase protein. After reaction initiation, the activity was continuously monitored at 340 nm for 10 min at 30 s intervals using a spectrophotometer. The initial reaction rate was calculated by measuring changes in NADPH absorbance. The experimental results validated 8 genes with relative activity superior to EcCura and 5 genes with relative activity superior to VvCura, as shown below. Figure 7 .
[0052] The number of devices and processing scale described herein are for simplification. Applications, modifications, and variations of the method and system for efficiently screening highly active isoenzymes of this invention will be readily apparent to those skilled in the art.
[0053] Although embodiments of the present invention have been disclosed above, they are not limited to the applications listed in the specification and embodiments. They can be applied to various fields suitable for the present invention. For those skilled in the art, other modifications can be easily made. Therefore, without departing from the general concept defined by the claims and their equivalents, the present invention is not limited to the specific details and illustrations shown and described herein.
Claims
1. A method for efficiently screening highly active isoenzymes, characterized in that, include: S1. Using at least one seed enzyme sequence with a known function as the query sequence, perform homologous sequence retrieval and preliminary screening to obtain a preliminary set of filtered sequences after length and global sequence similarity. S2. Perform multiple rounds of progressive hierarchical clustering on the preliminary screening sequence set to obtain a candidate enzyme sequence set; S3. Perform a performance evaluation of the sequences in the candidate enzyme sequence set based on molecular simulation, and screen out candidate sequences with binding free energy superior to the seed enzyme as priority targets for experimental verification of highly active isoenzymes. In step S2, the multi-round progressive hierarchical clustering includes: Each round of clustering uses a different clustering threshold, and the clustering threshold decreases in each round: After each round of clustering, only the sequences within the cluster containing the seed enzyme sequence are passed to the next round of clustering; Except for the last round, after each round of clustering, the central sequence of the adjacent clusters of the cluster containing the seed enzyme sequence is extracted as the candidate enzyme sequence. However, in the first round of clustering, for clusters whose number of sequences in the adjacent clusters of the cluster containing the seed enzyme sequence exceeds the preset size, they are re-clustered separately under the condition of being below the threshold of the first round of clustering, and their central sequences are extracted as the candidate enzyme sequences. In the final round of clustering, the central sequences of all clusters are extracted as candidate enzyme sequences; All candidate enzyme sequences and sequences within the cluster of the first-round seed enzyme sequences are merged to form a candidate enzyme sequence set.
2. The method for efficiently screening highly active isoenzymes as described in claim 1, characterized in that, The preliminary screening in step S1 includes: A length screening range is set based on the amino acid length of the seed enzyme sequence, and sequences whose length exceeds the length screening range are removed. The length screening range is ±10%~20% of the amino acid length of the seed enzyme sequence. A similarity screening range is set based on global sequence similarity, and sequences with similarity to the seed enzyme sequence within the similarity screening range are retained. The similarity screening range is 20% to 90%. For sequences from the same species, only the sequence with the highest similarity to the seed enzyme sequence is retained.
3. The method for efficiently screening highly active isoenzymes as described in claim 1, characterized in that, The homology sequence search in step S1 uses the HMMER program to search the Uniref90 protein sequence database.
4. The method for efficiently screening highly active isoenzymes as described in claim 1, characterized in that, The performance evaluation based on molecular simulation in step S3 specifically includes: S31. Perform protein three-dimensional structure prediction on all candidate enzyme sequences in the candidate enzyme sequence set; S32. Perform molecular docking between the predicted three-dimensional structures of each candidate enzyme sequence protein and the target substrate to obtain the preliminary complex structure of each candidate enzyme sequence-substrate. S33. Molecular dynamics simulations were performed to optimize the preliminary structure of the enzyme sequence-substrate complex for each candidate enzyme; S34. Based on the optimized structure of each candidate enzyme sequence-substrate complex, the binding free energy of each candidate enzyme sequence to the substrate is calculated using the free energy calculation method.
5. The method for efficiently screening highly active isoenzymes as described in claim 4, characterized in that, In step S31, AlphaFold3 is used to predict the three-dimensional structure of the protein; in step S32, AutoDOCKVina is used for molecular docking; in step S33, GROMACS is used for molecular dynamics simulation; and in step S34, the free energy calculation method is the MM / GBSA method.
6. The method for efficiently screening highly active isoenzymes as described in claim 4, characterized in that, Between steps S32 and S33, a conservation analysis and structural rationality assessment are performed on the preliminary structures of each candidate enzyme sequence-substrate complex. Candidate enzyme sequences with mutations or deletions in key catalytic sites, deletions in key structural domains, or structural abnormalities are eliminated to obtain reasonable preliminary structures of candidate enzyme sequences-substrate complexes. The conservation analysis is performed by analyzing the conservation of known key catalytic amino acid sites of the seed enzyme in the candidate enzyme sequence through multiple sequence alignment. The structural rationality assessment is performed by analyzing the integrity of known key structural domains of the seed enzyme in the predicted structure of the candidate enzyme sequence through three-dimensional structural visualization. For candidate enzyme-substrate complex structures that do not meet the structural rationality assessment, after molecular dynamics simulation optimization in step S33, step S32 is repeated to obtain new candidate enzyme-substrate complex structures, and then step S34 is executed.
7. The method for efficiently screening highly active isoenzymes as described in claim 1, characterized in that, It also includes conducting experimental verification of the high-activity isoenzymes selected in step S3, confirming their relative catalytic activity through enzyme activity assays, and obtaining high-activity isoenzymes.
8. A system for efficiently screening highly active isoenzymes, characterized in that, include: The homology retrieval and preliminary screening module is used to perform homology sequence retrieval and preliminary screening using at least one seed enzyme sequence with a known function as the query sequence, and to obtain a preliminary screening sequence set after filtering by length and global sequence similarity. A progressive hierarchical clustering module is used to perform multiple rounds of progressive hierarchical clustering on the preliminary screened sequence set to obtain a candidate enzyme sequence set; wherein: Each round of clustering uses a different clustering threshold, and the clustering threshold decreases in each round. After each round of clustering, only the sequences within the cluster containing the seed enzyme sequence are passed to the next round of clustering; Except for the last round, after each round of clustering, the central sequences of the clusters adjacent to the cluster containing the seed enzyme sequence are extracted as candidate enzyme sequences; During the first round of clustering, for clusters whose number of sequences in adjacent clusters of the seed enzyme sequence exceeds a preset size, they are re-clustered separately under the condition of being below the first round of clustering threshold, and their central sequences are extracted as candidate enzyme sequences. In the final round of clustering, the central sequences of all clusters are extracted as candidate enzyme sequences; Merge all candidate enzyme sequences and sequences within the cluster of the first-round seed enzyme sequences to output a set of candidate enzyme sequences; The molecular simulation evaluation module is used to perform performance evaluation of the sequences in the candidate enzyme sequence set based on molecular simulation, and to obtain the binding free energy of each candidate enzyme sequence to the target substrate. The screening output module is used to screen out candidate sequences with binding free energy superior to the seed enzyme, which are the priority targets for experimental verification of highly active isoenzymes. The molecular simulation evaluation module includes: The structure prediction unit is used to predict the three-dimensional protein structure of all candidate enzyme sequences in the candidate enzyme sequence set. The molecular docking unit is used to perform molecular docking between the predicted three-dimensional structures of each candidate enzyme sequence protein and the target substrate to obtain the preliminary complex structure of each candidate enzyme sequence-substrate. The kinetics simulation unit is used to perform molecular dynamics simulations to optimize the preliminary structure of the enzyme sequence-substrate complex for each candidate enzyme; Combined with a free energy calculation unit, it is used to calculate the binding free energy between each candidate enzyme sequence and substrate based on the optimized candidate enzyme sequence-substrate complex structure using a free energy calculation method.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method for efficiently screening highly active isoenzymes as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the method for efficiently screening highly active isoenzymes as described in any one of claims 1 to 7.