Method and device for processing directed evolution data of enzyme, and storage medium

By integrating multi-dimensional enzyme directed evolution data and using ECD structures for structuring, the problems of incompatible data formats and lack of catalytic performance data in existing technologies have been solved, enabling efficient storage and retrieval of enzymes and improving the digital level of directed evolution technology.

CN115206417BActive Publication Date: 2026-06-19SHANGHAI SYNTHEALL PHARM CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI SYNTHEALL PHARM CO LTD
Filing Date
2022-05-26
Publication Date
2026-06-19

Smart Images

  • Figure CN115206417B_ABST
    Figure CN115206417B_ABST
Patent Text Reader

Abstract

This invention relates to a method and apparatus for processing directed evolution data of enzymes, and a computer-readable storage medium. The processing method includes the following steps: acquiring retrieval information about a target enzyme; determining a corresponding candidate set based on the retrieval information, wherein the candidate set includes at least one data structure, and the data structure includes at least a directed evolution data field and a comparison information field; determining the similarity score between the retrieval information and the comparison information field of each data structure in the candidate set; selecting at least one data structure similar to the target enzyme from the candidate set according to a preset similarity threshold; and outputting the directed evolution data field and / or comparison information field of each data structure similar to the target enzyme.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to directed evolution technology of enzymes, and more particularly to a method for processing directed evolution data of enzymes, an apparatus for processing directed evolution data of enzymes, and a computer-readable storage medium. Background Technology

[0002] Enzymes are large molecules in living organisms that perform catalytic functions. Their chemical composition is typically proteins, ribonucleic acid (RNA), or complexes of these with small organic molecules and metal ions. Enzyme-catalyzed reactions often reduce the number of steps compared to pure organic chemical synthesis, achieving higher atom economy and yield. Furthermore, enzymes themselves are degradable and obtainable from the biological world, making them renewable resources. Directed evolution of enzymes involves introducing one or more differences into the enzyme's molecular structure using molecular biology techniques, replacing one or more amino acid residues that make up the protein with different amino acids to induce mutations. Enzymes with one or more alterations in their molecular structure are called mutants. Technicians perform different mutations on a pre-existing enzyme to obtain a large number of mutant combinations. These combinations are then screened under specific reaction conditions to identify progeny enzymes with improved performance compared to the pre-existing enzyme. By repeatedly performing this mutation and screening process over a limited number of iterations, performance improvements can be accumulated, resulting in progeny enzymes with significantly improved performance than the earliest pre-existing enzyme (i.e., the ancestral enzyme) and suitable for non-natural substrates and reaction conditions.

[0003] Because the directed evolution process requires the construction and screening of a large number of mutants, the greater the number, the greater the probability of obtaining mutants with improved performance. Existing directed evolution techniques generally require repeated and extensive mutation and screening processes, resulting in long evolutionary cycles and high costs. To overcome this limitation, some directed evolution techniques based on digital methods have been proposed, enabling limited visualization based on existing data structures such as SMILES, InChI, mol2, SDF, FASTA, GenBank, PDB, and MMCIF, facilitating intuitive understanding and sharing by technical personnel. However, these existing data structures are generally constructed from single-dimensional data focused on ligand small molecules, enzyme sequences, or enzyme structures, resulting in problems such as limited data content and incompatible formats. Furthermore, they lack catalytic performance data such as catalytic activity, specificity, and tolerance, as well as data characterizing bioinformatics relationships between various enzymes, which are essential for directed evolution research. Therefore, they cannot meet the application requirements of directed evolution techniques in large-scale storage and big data retrieval, thus limiting the digital development of directed evolution techniques.

[0004] In order to overcome the above-mentioned defects in the existing technology, there is an urgent need in the field for a technology for processing enzyme directed evolution data, which can be used to integrate multi-dimensional data involved in the directed evolution process of enzymes, characterize the catalytic performance of enzymes, and achieve efficient storage and retrieval of this multi-dimensional data. Summary of the Invention

[0005] The following provides a brief overview of one or more aspects to offer a basic understanding of them. This overview is not an exhaustive summary of all conceived aspects, nor is it intended to identify key or decisive elements of all aspects, nor to define the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed descriptions that follow.

[0006] To overcome the aforementioned deficiencies in the prior art, the present invention provides a method for processing directed evolution data of enzymes, a device for processing directed evolution data of enzymes, and a computer-readable storage medium, which can integrate multi-dimensional data involved in the directed evolution process of enzymes, characterize the catalytic performance of enzymes, and achieve efficient storage and retrieval of this multi-dimensional data.

[0007] Specifically, the method for processing directed evolution data of the enzyme according to the first aspect of the present invention includes the following steps: obtaining retrieval information about the target enzyme; determining a corresponding candidate set based on the retrieval information, wherein the candidate set includes at least one data structure, the data structure including at least a directed evolution data field and a comparison information field; determining the similarity score between the retrieval information and the comparison information field of each data structure in the candidate set; selecting at least one data structure similar to the target enzyme from the candidate set according to a preset similarity threshold; and outputting the directed evolution data field and / or comparison information field of each data structure similar to the target enzyme.

[0008] Furthermore, in some embodiments of the present invention, the retrieval information includes sequence content containing a specific sequence of the target enzyme. The candidate set includes a sequence candidate set, including at least one sequence node structure. The sequence node structure includes at least a directed evolution data field and an alignment sequence field, each of the alignment sequence fields recording an enzyme sequence. The step of determining the similarity score between the retrieval information and the alignment information fields of each data structure in the candidate set includes: performing a multi-sequence alignment operation on the sequence content of the target enzyme and the alignment sequence fields of each sequence node structure in the sequence candidate set to determine the similarity score between the sequence content of the target enzyme and each alignment sequence field.

[0009] Furthermore, in some embodiments of the present invention, the retrieval information includes structural content containing a specific structure of the target enzyme. The candidate set includes a structure candidate set, including at least one structural node structure. The structural node structure includes at least a directed evolution data field and an alignment structure field, each of the alignment structure fields recording an enzyme structure. The step of determining the similarity score between the retrieval information and the alignment information fields of each of the data structures in the candidate set includes: performing a structure alignment operation on the structural content of the target enzyme and the alignment structure fields of each of the structural node structures in the structural candidate set to determine the similarity score between the structural content of the target enzyme and each of the alignment structure fields.

[0010] Further, in some embodiments of the present invention, the retrieval information includes molecular information about a specific ligand of the target enzyme. The candidate set includes a ligand candidate set, which includes at least one ligand node structure. The ligand node structure includes at least a directed evolution data field and a comparison ligand field, each of the comparison ligand fields recording molecular information of a ligand. The step of determining the similarity score between the retrieval information and the comparison information fields of each of the data structures in the candidate set includes: performing a molecular similarity calculation on the molecular information of the specific ligand of the target enzyme and the comparison ligand fields of each of the ligand node structures in the ligand candidate set to determine the similarity score between the molecular information of the specific ligand of the target enzyme and each of the comparison ligand fields.

[0011] Furthermore, in some embodiments of the present invention, the data structure further includes a similarity score field. The processing method further includes the following steps: comparing and determining the similarity score and the similarity threshold; in response to the similarity score being greater than or equal to the similarity threshold, recording the similarity score in the similarity score field of the corresponding data structure; and outputting the similarity score fields of each data structure similar to the target enzyme.

[0012] Furthermore, in some embodiments of the present invention, before determining the corresponding candidate set based on the retrieval information, the processing method further includes the following steps: constructing a database storing directed evolution data of at least one enzyme; defining at least one data structure according to the supported retrieval information types, wherein the data structure includes at least one of a sequence node structure, a structure node structure, and a ligand node structure; constructing at least one candidate set to accommodate the data structure, wherein each candidate set accommodates one type of data structure; traversing the database according to the alignment information fields corresponding to each of the retrieval information types, and creating a data structure for each piece of data in the database involving each of the alignment information fields; and adding the created data structure to the corresponding candidate set according to the retrieval information type.

[0013] Furthermore, in some embodiments of the present invention, the data structure further includes a unique identifier field. Before determining the corresponding candidate set based on the retrieval information, the processing method further includes the following steps: traversing the database according to the unique identifier field, and creating the data structure for each piece of data in the database; and for data structures that do not involve the alignment information field, determining the content of the alignment information field based on the content recorded in the directed evolution data field, and / or outputting a prompt to supplement the content of the alignment information field.

[0014] Furthermore, in some embodiments of the present invention, the directed evolution data field includes at least one of the following: alias field, wild ancestor field, expression system field, unique number field, scientific name field, remarks field, biological origin field, parental field, catalytic performance field, reference field, enzyme sequence field, and enzyme structure field.

[0015] Furthermore, in some embodiments of the present invention, the expression system field adopts an expression system structure, which includes at least one of the following: expression host field, remarks field, enzyme-bearing gene field, and unique number field. Additionally, the catalytic performance field adopts a catalytic performance structure, which includes at least one of the following: reaction conditions field, experiment number field, reaction product field, reaction reagent field, and reaction substrate field. Furthermore, the reference field adopts a reference structure, which includes at least one of the following: citation source field, publication date field, remarks field, citation URI field, and citation title field. Furthermore, the enzyme sequence field adopts an enzyme sequence structure, which includes at least one of the following: GenBank accession number field, GI number field, mutation field, remarks field, citation URI field, sequence content field, sequence URI field, sequence category field, and UniProt ID field. In addition, the enzyme structure field adopts an enzyme structure structure, which includes at least one of the following: ligand field, mutation field, memo field, citation URI field, sequence URI field, structure URI field, structure content field, and structure category field.

[0016] Furthermore, in some embodiments of the present invention, the reaction condition field adopts a reaction condition structure, which includes at least one of a relative humidity field, a pH field, a reaction time field, a reactor field, and a reaction temperature field. Additionally, the reaction product field adopts a reaction product structure, which includes at least one of a reaction conversion rate field, a product diastereomer excess value field, a product diastereomer ratio field, a product enantiomer excess value field, an enantiomer selectivity field, a product enantiomer ratio field, a product molecule field, a product purity field, a product positional isomer excess value field, a product positional isomer ratio field, a separation yield field, and an in-situ yield field. Furthermore, the reaction reagent field consists of at least one reaction reagent structure, each of which corresponds to a reaction reagent, and includes at least one of an addition method field, a dilution method field, an addition amount field, and a reagent molecule field. Furthermore, the reaction substrate field consists of at least one reaction substrate structure, each of which corresponds to a reaction substrate, and includes at least one of the following: addition method field, dilution method field, amount added field, and substrate molecule field.

[0017] Furthermore, in some embodiments of the present invention, the reactor field adopts a reactor structure, which includes at least one of a stirring field, a diameter field, a height field, and a shape field. Additionally, at least one of the product molecule field, the reagent molecule field, and the substrate molecule field adopts a ligand structure, which includes at least one of a CAS number field, an InChI field, a scientific name field, a SMILES field, and a chemical structure field. Furthermore, the addition method field adopts an addition method structure, which includes at least one of a method field, a rate field, and a time point field. Furthermore, the dilution method field adopts a dilution structure, which includes at least a diluent field and / or a diluent dosage field, wherein the diluent field adopts the ligand structure. Furthermore, the amount added field adopts a physical quantity structure, which includes at least one of a lower limit field, a target value field, a unit field, and an upper limit field.

[0018] Furthermore, in some embodiments of the present invention, the stirring field adopts a stirring structure, which includes at least one of a stirring amplitude field, a stirring method field, and a stirring speed field.

[0019] Furthermore, in some embodiments of the present invention, the mutation field consists of at least one mutation structure, each mutation structure corresponding to a mutation, wherein it includes at least one of a mutation sequence fragment field, a mutation location field, a template sequence fragment field, and a mutation type field.

[0020] Furthermore, in some embodiments of the present invention, each of the fields is composed of interrelated field keys and field values, wherein the field key indicates the field type and the field value indicates the content corresponding to the field type.

[0021] Furthermore, the apparatus for processing directed evolution data of the enzymes described above, provided according to a second aspect of the present invention, includes a memory and a processor. The processor is connected to the memory and configured to implement the method for processing directed evolution data of the enzymes described above, provided according to a first aspect of the present invention.

[0022] Furthermore, the computer-readable storage medium provided according to the third aspect of the present invention stores computer instructions thereon. When the computer instructions are executed by a processor, a method for processing directed evolution data of the enzymes provided in the first aspect of the present invention is implemented. Attached Figure Description

[0023] The above-described features and advantages of the present invention will be better understood after reading the following detailed description of embodiments of the present disclosure in conjunction with the accompanying drawings. In the drawings, components are not necessarily drawn to scale, and components having similar related characteristics or features may have the same or similar reference numerals.

[0024] Figure 1 A flowchart illustrating a method for storing enzyme-directed evolution data according to some embodiments of the present invention is shown.

[0025] Figure 2 A schematic diagram of the ECD structure of directed evolution data of an enzyme provided according to some embodiments of the present invention is shown.

[0026] Figure 3 A schematic diagram of a non-relational database of directed evolution data of enzymes provided according to some embodiments of the present invention is shown.

[0027] Figure 4 A flowchart illustrating a method for processing enzyme-directed evolution data according to some embodiments of the present invention is shown.

[0028] Figure 5 A flowchart illustrating a method for processing enzyme-directed evolution data according to some embodiments of the present invention is shown.

[0029] Figure 6 A flowchart illustrating a method for processing enzyme-directed evolution data according to some embodiments of the present invention is shown. Detailed Implementation

[0030] The following specific embodiments illustrate the implementation of the present invention. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. Although the description of the present invention is presented in conjunction with preferred embodiments, this does not mean that the features of the invention are limited to these embodiments. On the contrary, the purpose of describing the invention in conjunction with embodiments is to cover other options or modifications that may be derived based on the claims of the present invention. To provide a thorough understanding of the invention, many specific details will be included in the following description. The invention may also be implemented without using these details. Furthermore, to avoid confusion or obscuring the focus of the invention, some specific details will be omitted in the description.

[0031] It is understood that although terms such as "first," "second," and "third" may be used herein to describe various components, regions, layers, and / or parts, these components, regions, layers, and / or parts should not be limited by these terms, and these terms are only used to distinguish different components, regions, layers, and / or parts. Therefore, the first components, regions, layers, and / or parts discussed below may be referred to as second components, regions, layers, and / or parts without departing from some embodiments of the present invention.

[0032] As mentioned above, existing data structures such as SMILES, InChI, mol2, SDF, FASTA, GenBank, PDB, and MMCIF are generally constructed as single-dimensional data for ligand small molecules, enzyme sequences, or enzyme structures. They suffer from problems such as limited data content and incompatible formats. Furthermore, they lack catalytic performance data such as catalytic activity, specificity, and tolerance, as well as data characterizing the bioinformatics relationships between various enzymes, which are required for directed evolution research of enzymes. Therefore, they cannot meet the application needs of directed evolution technology in large-scale storage and big data retrieval, thus limiting the digital development of directed evolution technology of enzymes.

[0033] To overcome the aforementioned deficiencies in the prior art, the present invention provides a method for processing directed evolution data of enzymes, a device for processing directed evolution data of enzymes, and a computer-readable storage medium, which can integrate multi-dimensional data involved in the directed evolution process of enzymes, characterize the catalytic performance of enzymes, and achieve efficient storage and retrieval of this multi-dimensional data.

[0034] In some non-limiting embodiments, the processing method provided in the first aspect of the present invention can be implemented by the processing apparatus provided in the second aspect of the present invention. Specifically, the processing apparatus includes a memory and a processor. The memory includes, but is not limited to, the computer-readable storage medium provided in the third aspect of the present invention, on which computer instructions are stored. The processor is connected to the memory and configured to execute the computer instructions stored in the memory to implement the processing method for directed evolution data of enzymes provided in the first aspect of the present invention.

[0035] Furthermore, the processing method provided in the first aspect of the present invention can be implemented in two stages: offline storage and online retrieval. Correspondingly, the processing apparatus provided in the second aspect of the present invention can also consist of a storage device and a retrieval device. In some embodiments, the storage device and the retrieval device can be integrated into the same hardware device. Optionally, in other embodiments, the storage device and the retrieval device can be distributed across multiple different hardware devices, and these multiple different hardware devices cooperate with each other to implement the processing method provided in the first aspect of the present invention.

[0036] The working principle of the above-described storage device and storage medium will first be described with reference to some embodiments of storage methods. Those skilled in the art will understand that these storage methods are merely non-limiting embodiments provided by the present invention, intended to clearly demonstrate the main concept of the invention and provide specific solutions convenient for public implementation, rather than limiting all functions or all operating methods of the above-described storage device and storage medium. Similarly, the storage device and storage medium are also merely a non-limiting embodiment provided by the present invention, and do not constitute a limitation on the entities performing the steps in these storage methods.

[0037] Please refer to Figure 1 , Figure 1 A flowchart illustrating a method for storing enzyme-directed evolution data according to some embodiments of the present invention is shown.

[0038] like Figure 1 As shown, in the process of storing directed evolution data of enzymes, the storage device can first acquire directed evolution data of at least one enzyme to construct a database storing the directed evolution data of that at least one enzyme. Here, the directed evolution data includes, but is not limited to, multi-dimensional related data such as the enzyme's alternative names, wild ancestor information, natural host information, DNA information of the gene carrying the enzyme, scientific name of the enzyme, biological origin information, evolutionary parent information, catalytic performance data, references or cited materials describing the enzyme, enzyme sequence data, and enzyme structural data.

[0039] Specifically, in constructing a directed evolution database for enzymes, the storage device can first perform structured processing on the acquired directed evolution data to determine at least one directed evolution data field for the target enzyme. In some embodiments, the above-mentioned structured processing can be based on a pre-built ECD (EvoCloud Droplet) structure. Please refer to... Figure 2 , Figure 2 A schematic diagram of the ECD structure of directed evolution data of an enzyme provided according to some embodiments of the present invention is shown.

[0040] like Figure 2 As shown, in the process of structuring enzyme directed evolution data, the storage device can first fill the enzyme directed evolution data into the [A1~A12] fields according to the specified data type based on the pre-constructed data structure, thereby constructing an ECD structure for the directed evolution data of each enzyme, so as to integrate the multi-dimensional data involved in the directed evolution process of enzymes, characterize the catalytic performance of enzymes, and realize the efficient storage and retrieval of these multi-dimensional data.

[0041] The data types and data structures involved in this ECD structure are described in the [A0] field below.

[0042] [A0-1] String: A general concept in electronic computer systems.

[0043] [A0-2] Integer: A general concept in electronic computer systems.

[0044] [A0-3] Floating-point number: A general concept in electronic computer systems.

[0045] [A0-4] Date and time: A general concept in electronic computer systems.

[0046] [A0-5] Charging Structure: Describes the method of adding a certain material in the reaction used to measure the catalytic performance of an enzyme, and includes the following member fields:

[0047] [A0-5-1] Method field: Describes the feeding method. Its data type is string, and the allowed value is one of the string enumeration values ​​["continuous feeding", "one time charging", "other", "portionwise charging"], which respectively represent continuous feeding, one-time feeding, other, and batch feeding.

[0048] [A0-5-2] Speed ​​field: Describes the feeding speed; its data type is a physical quantity structure. If the method field is "continuous feeding", the unit field can be one of the string enumeration values ​​["L / h", "mL / h", "mL / min", "mL / s", "VVH", "VVM", "VVS"]. If the method field is "portionwise charging", the unit field can be one of the string enumeration values ​​["L / time", "mL / time", "V / time"].

[0049] [A0-5-3] The `timePoints` field describes the time points when feeding is performed. Its data type is an array of several physical quantity structures. If the `method` field is set to "continuous feeding" or "one time charging", the `timePoints` field contains only one member, the feeding start time. If the `method` field is set to "portionwise charging", the `timePoints` field can contain multiple members, each representing the time point of each feeding operation. The unit field for each member in the `timePoints` field can have a value from the string enumeration ["day", "h", "min", "s"].

[0050] [A0-6] Dilution structure: Describes the solution composition of a material used in a reaction that measures the catalytic performance of an enzyme, when the material is used in solution form. It includes the following member fields:

[0051] [A0-6-1] Solvent field: Describes the solvent or diluent of the solution, and its data type is ligand structure.

[0052] [A0-6-2] Loading field: Describes the amount of solvent or diluent used. Its data type is a physical quantity structure, and its unit field allows values ​​to be one of the string enumeration values ​​["L", "mL", "V", "μL"].

[0053] [A0-7] Ligand structure: describes information about a small molecule, such as its structure, composition, and index conforming to general rules. It contains the following member fields:

[0054] [A0-7-1] CAS field: Describes the CAS number of a substance. This number is assigned to each registered substance in the Chemical Abstracts Service (CISA), a subsidiary of the American Chemical Society. It is widely used to describe chemicals, especially those intended for sale and production. The data type is string.

[0055] [A0-7-2] inChI field: Describes a substance's InChI (International Chemical Identifier), a string jointly developed by the International Union of Pure and Applied Chemistry (IUPAC) and the National Institute of Standards and Technology (NIST) to uniquely identify a compound's IUPAC name. The data type is string.

[0056] [A0-7-3] The `iupacName` field describes the scientific name of a substance according to the systematic nomenclature rules. This nomenclature is defined by the International Union of Pure and Applied Chemistry (IUPAC) and is a method for systematically naming chemical substances, specifying chemical terms from organic to inorganic, from molecules to polymers, and in various other aspects. The data type is string.

[0057] [A0-7-4] The `smiles` field: describes a SMILES (Simplified Molecular InputLine Entry System) string for a substance, a specification for explicitly describing molecular structure using ASCII strings. SMILES strings can be imported by most molecular editing software and converted into two-dimensional graphics or three-dimensional molecular models, and are one of the common methods for processing the chemical composition of small molecules in current computer systems. The data type is string.

[0058] [A0-7-5] Structure field: A mol2 format text describing the chemical structure of a substance, including the elemental composition, interconnections, valence states, charges, and three-dimensional coordinates of the atoms contained in its molecule. Biochemical calculation software such as SYBYL and Discovery Studio typically use the mol2 format to store the chemical information of small molecules. The data type is string.

[0059] [A0-8] Mutation structure: Describes a difference in an amino acid or base sequence relative to its natural ancestor at some point, i.e., a mutation, and contains the following member fields:

[0060] [A0-8-1] MutationMotif field: Describes a fragment of the sequence at the location of the mutation, and its data type is string. If the type field is "nucleotide", the mutationMotif field allows one or more string enumeration values ​​["A", "C", "G", "T"] in one or three combinations, representing adenine deoxyribonucleotide, cytosine deoxyribonucleotide, guanine deoxyribonucleotide, and thymine deoxyribonucleotide, respectively. If the type field is "peptide", the mutationMotif field can be one of the string enumeration values ​​["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"], representing alanine residue, cysteine ​​residue, aspartic acid residue, glutamic acid residue, phenylalanine residue, glycine residue, histidine residue, isoleucine residue, lysine residue, leucine residue, methionine residue, asparagine residue, proline residue, glutamine residue, arginine residue, serine residue, threonine residue, valine residue, tryptophan residue, and tyrosine residue.

[0061] [A0-8-2] Position field: Describes the position of the mutation location relative to the entire sequence. Its data type is integer, and positive integers are allowed.

[0062] [A0-8-3] TemplateMotif field: Describes a fragment of the sequence corresponding to the mutation site in the ancestor of nature. Its data type is string. If the type field is "nucleotide", the templateMotif field can have one or more string enumeration values ​​["A", "C", "G", "T"] in one or three combinations, representing adenine deoxyribonucleotide, cytosine deoxyribonucleotide, guanine deoxyribonucleotide, and thymine deoxyribonucleotide, respectively. If the type field is "peptide", the templateMotif field can be one of the string enumeration values ​​["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"], representing alanine residue, cysteine ​​residue, aspartic acid residue, glutamic acid residue, phenylalanine residue, glycine residue, histidine residue, isoleucine residue, lysine residue, leucine residue, methionine residue, asparagine residue, proline residue, glutamine residue, arginine residue, serine residue, threonine residue, valine residue, tryptophan residue, and tyrosine residue.

[0063] [A0-8-4] Type field: Describes the type of mutation. Its data type is string, and the allowed value is one of the string enumeration values ​​["nucleotide", "peptide"], which represent mutations in the base sequence and mutations in the amino acid sequence, respectively.

[0064] [A0-9] Physical Quantity Structure: Describes a physical quantity. Scientifically, a physical quantity consists of a numerical value and a unit. Furthermore, in engineering, since absolute precision is impossible in any measurement or instrument setting, it is necessary to specify the minimum and maximum permissible deviations of the physical quantity from the target value as a process parameter.

[0065] [A0-9-1] Lower Limit (loweLimit) field: The lower limit of allowed values, and its data type is floating point.

[0066] [A0-9-2] Target Value field: The set target value, whose data type is floating point.

[0067] [A0-9-3] Unit field: The unit of the physical quantity, which is generally the basic unit specified by IUPAP or the product of their finite powers. Its data type is string.

[0068] [A0-9-4] UpperLimit field: The upper limit of allowed values, and its data type is floating point.

[0069] Furthermore, based on the definition of the [A0] field above, the [A1~A12] fields of the ECD structure can be defined as follows.

[0070] [A1] Alias ​​field: Describes an alternative name for the enzyme. Its data type is an array of strings. Enzymes can usually have multiple aliases, used for abbreviation in literature or as trade names, etc.

[0071] [A2] Wild Ancestor (acestorId) field: A unique identifier describing the wild ancestor of the enzyme in the storage system, and its data type is string.

[0072] [A3] ExpressionSystem field: Describes the enzyme's expression system. Its data type is an expressionSystem structure, which contains the following member fields:

[0073] [A3-1] Host field: Describes the expression host of the enzyme, and its data type is string. The source of the enzyme can be its natural source organism, i.e., the natural host, or an engineered host such as recombinant cells used for molecular cloning or overexpression.

[0074] [A3-2] Note field: Note information, its data type is string.

[0075] [A3-3] Vector field: Describes the DNA used to carry the enzyme gene; its data type is string. In different hosts, the enzyme gene can be integrated into the cell's chromosome or chromatin DNA, i.e., in the genome, or it can be embedded in small DNA independent of the genome, such as plasmids or organelle DNA.

[0076] [A4] id field: Describes the enzyme's unique identifier in the storage system; its data type is string.

[0077] [A5] Name field: Describes the scientific name of the enzyme; its data type is string.

[0078] [A6] Note field: Note information, whose data type is string.

[0079] [A7] Organism field: Describes the biological origin of the enzyme. Its data type is string. It is usually the organism from which the enzyme was first discovered or isolated, and follows the scientific name of the binomial nomenclature commonly used in modern biological taxonomy.

[0080] [A8] ParentId field: A unique identifier describing the evolutionary parent of the enzyme in the storage system. Its data type is string.

[0081] [A9] Catalytic Performance (performances) field: Describes the enzyme's catalytic performance data. Its data type is an array of several catalytic performance (performance) structures. Each catalytic performance (performance) structure represents one piece of enzyme reaction performance data and contains the following member fields:

[0082] [A9-1] The `conditions` field describes the conditions and parameters of the reaction that measure the catalytic performance of an enzyme. Its data type is a `conditions` structure, containing the following member fields:

[0083] [A9-1-1] Humidity field: relative humidity, its data type is floating point, and the allowed value is 0~100%.

[0084] [A9-1-2] pH field: pH value, its data type is floating point, and the allowed value is 0~14.

[0085] [A9-1-3] Reaction Time field: Describes the reaction time that measures the catalytic performance of an enzyme. Its data type is a physical quantity structure, and its unit field allows one of the string enumeration values ​​["ms", "s", "min", "h", "day"].

[0086] [A9-1-4] Reactor field: Describes the container in which the reaction takes place to measure the catalytic performance of an enzyme. Its data type is a reactor structure, which contains the following member fields:

[0087] [A9-1-4-1] Agitation field: Describes the agitation method and speed. Its data type is an agitation structure, which contains the following member fields:

[0088] [A9-1-4-1-1] Magnitude field: describes the magnitude of stirring, such as the size of the stirring magnet, the diameter of the mechanical stirring paddle, the flow rate of the air rise or the velocity of the crossflow, etc. Its data type is a physical quantity structure.

[0089] [A9-1-4-1-2] Method field: Describes the mixing method. Its data type is string, and the allowed value is one of the string enumeration values ​​["Air lift", "Cross current", "Linear shaking", "Magneticagitation", "Mechanic agitation", "Orbit shaking", "Other", "Vertex mixing"], which respectively represent air lift mixing, cross-flow mixing, reciprocating oscillation, magnetic stirring, mechanical stirring, circular oscillation, other, and vortex mixing.

[0090] [A9-1-4-1-3] Speed ​​field: describes the stirring speed. Its data type is a physical quantity structure, and its unit field allows one of the string enumeration values ​​["Hz", "m / s", "rad / s", "rpm"].

[0091] [A9-1-4-2] Diameter field: Describes the diameter of the reactor. Its data type is a physical quantity structure, and its unit field allows one of the string enumeration values ​​["cm", "dm", "m", "mm"].

[0092] [A9-1-4-3] Height field: Describes the height of the reactor. Its data type is a physical quantity structure, and its unit field allows one of the string enumeration values ​​["cm", "dm", "m", "mm"].

[0093] [A9-1-4-4] Shape field: Describes the shape of the reactor. Its data type is string, and the allowed value is one of the string enumeration values ​​["Eppendorf tube", "Glass vial", "Hydrogenation Reactor", "Jacket", "Microplate vial", "Round bottom flask", "T-flask", "Test tube", "Other"], representing centrifuge tube, glass bottle, hydrogenation reaction flask, jacketed reaction flask, microplate well, round bottom flask, conical flask, test tube, and others, respectively.

[0094] [A9-1-5] Temperature field: Describes the temperature at which the reaction takes place to measure the catalytic performance of the enzyme. Its data type is a physical quantity structure, and its unit field allows values ​​to be one of the string enumeration values ​​["℃", "K"].

[0095] [A9-2] id field: describes the experiment number, and its data type is string.

[0096] [A9-3] Product field: Describes the products and results of the reaction used to measure the catalytic performance of an enzyme. Its data type is a product structure, which contains the following member fields:

[0097] [A9-3-1] ConversionRatio field: Describes the conversion rate of the reaction that measures the catalytic performance of the enzyme. Its data type is a floating-point number, and the allowed value is a number from 0 to 1.

[0098] [A9-3-2] de field: describes the diastereomer excess value of the reaction product used to measure the catalytic performance of an enzyme. Its data type is floating point and allows values ​​from 0 to 1.

[0099] [A9-3-3] dr field: describes the ratio of diastereomers of the products of the reaction that measures the catalytic performance of an enzyme. Its data type is floating point and allows values ​​greater than 0.

[0100] [A9-3-4] ee field: describes the enantiomer excess value of the reaction product used to measure the catalytic performance of an enzyme. Its data type is floating point and allows values ​​from 0 to 1.

[0101] [A9-3-5] EnantioselectivityRatio field: Describes the enantioselectivity of the reaction for which the enzyme performs a catalytic performance. Its data type is a floating-point number, and values ​​greater than zero are allowed.

[0102] [A9-3-6] er field: describes the enantiomeric ratio of the products of the reaction that measures the catalytic performance of the enzyme. Its data type is floating point and allows values ​​greater than 0.

[0103] [A9-3-7] The molecule field describes the product molecule of the reaction that measures the catalytic performance of an enzyme. Its data type is ligand structure.

[0104] [A9-3-8] Purity field: Describes the purity of the product of the reaction that measures the catalytic performance of the enzyme. Its data type is floating point, and it allows values ​​from 0 to 1.

[0105] [A9-3-9] The re field describes the regioisomeric excess value of the product positional isomers of the reaction that measures the catalytic performance of the enzyme. Its data type is a floating-point number, and it allows values ​​from 0 to 1.

[0106] [A9-3-10] rr field: describes the ratio of product positional isomers in a reaction that measures the catalytic performance of an enzyme. Its data type is floating point and allows values ​​greater than 0.

[0107] [A9-3-11] Isolated Yield field: Describes the isolated yield of the reaction used to measure the catalytic performance of the enzyme. Its data type is a floating-point number, and the allowed value is a number from 0 to 1.

[0108] [A9-3-12] Solution Yield field: Describes the in-situ yield of the reaction used to measure the catalytic performance of the enzyme. Its data type is a floating-point number, and the allowed value is a number from 0 to 1.

[0109] [A9-4] The `reagents` field describes the reagents involved in the reaction that measures the catalytic performance of the enzyme. Its data type is an array of `reagents` structures. Each `product` structure represents a reagent and contains the following member fields:

[0110] [A9-4-1] Charging field: describes the way the reagent is added to the reaction, and its data type is the charging structure.

[0111] [A9-4-2] Dilution field: describes how a reagent is diluted when added in solution form. Its data type is a dilution structure.

[0112] [A9-4-3] Loading field: Describes the amount of reagent added. Its data type is physical quantity. Its unit field allows one of the following string enumeration values: ["eq.", "g", "L", "kg", "mg", "mL", "mmol", "mol", "V", "X", "μL"].

[0113] [A9-4-4] Molecule field: describes the reagent molecule, and its data type is ligand structure.

[0114] [A9-5] The `substrates` field describes the substrate or main feedstock of the reaction used to measure the catalytic performance of the enzyme. Its data type is an array of `substrate` structures. Each `substrate` structure represents a substrate or main feedstock and contains the following member fields:

[0115] [A9-5-1] Charging field: describes the way the substrate is added to the reaction in which the enzyme’s catalytic performance is measured. Its data type is the Charging structure.

[0116] [A9-5-2] Dilution field: describes the dilution method when the substrate is added in solution during the reaction that measures the catalytic performance of an enzyme. Its data type is the dilution structure.

[0117] [A9-5-3] Loading field: Describes the amount of substrate added to the reaction that measures the catalytic performance of the enzyme. Its data type is a physical quantity structure, and its unit field allows one of the string enumeration values ​​["eq.", "g", "L", "kg", "mg", "mL", "mmol", "mol", "V", "X", "μL"].

[0118] [A9-5-4] The molecule field describes the substrate molecule of the reaction that measures the catalytic performance of an enzyme. Its data type is ligand structure.

[0119] [A10] References field: Describes information about references or cited materials. Its data type is an array of reference structures. Each reference structure represents one reference or cited material and contains the following member fields:

[0120] [A10-1] Citation field: Describes the source of the citation, such as journals, books, dissertations, and their volume, issue, and page numbers. Its data type is string.

[0121] [A10-2] Date field: Describes the earliest date and time of publication of the citation, and its data type is date and time.

[0122] [A10-3] Note field: Describes the notes in the citation; its data type is string.

[0123] [A10-4] Reference URI field: A Uniform Resource Identifier that describes the access to the citation from the Internet or a storage system. Its data type is string.

[0124] [A10-5] Title field: Describes the title of the citation; its data type is string.

[0125] [A11] Sequences field: Describes the enzyme sequence. Its data type is an array of sequence structures. Each sequence structure represents one enzyme sequence and contains the following member fields:

[0126] [A11-1] GenBank Accession Field: Describes the GenBank Accession of the enzyme sequence, i.e., the accession number in the GenBank database. Its data type is string. GenBank is a DNA sequence database established by the National Center for Biotechnology Information (NCBI) in the United States. It obtains sequence data from public resources, primarily provided directly by researchers or from large-scale genome sequencing projects. To ensure the data is as complete as possible, GenBank has established cooperative relationships with EMBL (European EMBL-DNA Database) and DDBJ (Dual Degree Database: DNA Data Bank of Japan) for data exchange.

[0127] [A11-2] gi field: describes the GI number of the enzyme sequence, i.e., the GenInfoIdentifier, and its data type is string.

[0128] [A11-3] Mutations field: Describes all the differences in an enzyme sequence relative to its natural ancestor, i.e., mutations. Its data type is an array of mutation structures. Each mutation structure represents a mutation.

[0129] [A11-4] Note field: Describes notes about the enzyme sequence. Its data type is string.

[0130] [A11-5] ReferenceURI field: Describes the source from which this enzyme sequence was first reported, such as a reference or the Uniform Resource Identifier (URI) of the corresponding entry in a public database. Its data type is string.

[0131] [A11-6] Sequence field: Describes the specific content of the enzyme sequence. Its data type is string, and the allowed values ​​must conform to the FASTA format.

[0132] [A11-7] SequenceURI field: Describes a Uniform Resource Identifier (URI) for accessing this enzyme sequence from the Internet or a storage system. Its data type is string.

[0133] [A11-8] Type field: Describes the type of the enzyme sequence. Its data type is string, and the allowed values ​​are one of the string enumeration values ​​["Nucleotide", "Peptide"], which represent the base sequence and the amino acid sequence, respectively.

[0134] [A11-9] The `uniProt` field describes the UniProt ID of the enzyme sequence; its data type is string. UniProt is an abbreviation for Universal Protein, the most comprehensive and resource-rich protein database. It integrates data from three major databases: Swiss-Prot, TrEMBL, and PIR-PSD. Its data primarily comes from protein sequences obtained after the completion of genome sequencing projects. It contains a wealth of information on the biological functions of proteins from the literature.

[0135] [A12] Structures field: Describes the enzyme structure. Its data type is an array of several structure objects. Each structure object represents a three-dimensional structure of an enzyme and contains the following member fields:

[0136] [A12-1] Ligands field: Describes the ligands in the enzyme structure, that is, the small molecular components of the enzyme structure other than the main chain that makes up the protein. These can typically be water molecules, water-soluble ions, water-soluble small organic solutes, or small organic molecules bound to the protein surface or interior, such as substrates, products, inhibitors, etc. Its data type is an array of several ligand structures. Each ligand structure represents one ligand in the enzyme structure.

[0137] [A12-2] Mutations field: Describes all the differences between the sequence corresponding to the enzyme structure and its natural ancestor, i.e., mutations. Its data type is an array of mutation structures. Each mutation structure represents a mutation.

[0138] [A12-3] Note field: Note information describing the enzyme structure; its data type is string.

[0139] [A12-4] Reference URI field: Describes the source from which this enzyme structure was first reported, such as a reference or the Uniform Resource Identifier of the corresponding entry in a public database. Its data type is string.

[0140] [A12-5] SequenceURI field: Describes the source of the first report of the enzyme sequence corresponding to this enzyme structure, such as the Uniform Resource Identifier of the corresponding entry in a reference or public database. Its data type is string.

[0141] [A12-6] StructureURI field: A Uniform Resource Identifier (URI) describing access to this enzyme structure from the Internet or storage system. Its data type is string.

[0142] [A12-7] Structure field: Describes the specific content of the enzyme sequence. Its data type is string, and the allowed values ​​must conform to the PDB format.

[0143] [A12-8] Category (type) field: Describes the category of enzyme structure. Its data type is string, and the allowed value is one of the string enumeration values ​​["CryoSEM", "NMR", "Other", "Predicted Model", "XRD"], which respectively represent cryo-electron microscopy, nuclear magnetic resonance, other, predicted structure model, and X-ray crystal diffraction structure.

[0144] In response to the completion of structuring and obtaining the ECD structure of the target enzyme, the storage device can store the directed evolution data of the target enzyme into a computer-readable storage medium according to the architecture of the ECD structure, and collect ECD structures of various enzymes to construct the EvoCloud database of directed evolution data of enzymes. Thus, this invention can integrate multi-dimensional data involved in the directed evolution process of enzymes, characterize the catalytic performance of enzymes, and achieve efficient storage of this multi-dimensional data.

[0145] Furthermore, the aforementioned ECD structure can use JSON (JavaScript Object Notation) as its container format, which is the format defined by the ECMA-404 standard (European Computer Manufacturers Association Standard 404). JSON is a widely used computer program data exchange format. Its serialization, deserialization, node insertion, node deletion, and node editing are directly supported by the latest mainstream high-level computer programming languages ​​such as ECMAScript (ECMA-262), C (ISO / IEC 9899:2011), C++ (ISO / IEC 14882), Java (ISO / IEC TR 13066), and C# (ECMA-334). It can be directly understood and processed by computer programs without additional data processing.

[0146] Furthermore, the ECD structure provided by this invention can be implemented using a dictionary-based non-relational database organization structure. Please refer to... Figure 3 , Figure 3 A schematic diagram of a non-relational database of directed evolution data of enzymes provided according to some embodiments of the present invention is shown.

[0147] like Figure 3 As shown, in different computer programming languages ​​or database systems, a dictionary structure, also known as an associative array or map, is an abstract data structure containing multiple ordered pairs similar to (key, value). This data structure supports various common operations such as pair retrieval, adding pairs, deleting pairs, and modifying pairs. For example, in a pair retrieval operation, the operation parameter is the key to be searched, and the returned value is the corresponding value. If the corresponding key-value pair does not exist, some implementations will raise an exception, while others will create and add a new key-value pair using the given key, where the "value" is the default value of its type (zero, empty container, etc.). As another example, in the add pair operation, the storage device can add a new key-value pair and establish a mapping from the new key to the new value; the operation parameters are the key and value to be added. As yet another example, in the delete pair operation, the storage device can remove a key-value pair and cancel the mapping from that key to that value; the operation parameter is the key to be deleted. For example, in the operation of modifying a pair, the storage device can change the value of an existing key-value pair and map the original key to the new value, with the operation parameters being the key and the value.

[0148] In some embodiments, the storage device may designate one or more of the [A1] to [A12] fields of the ECD structure as directed evolution data fields that record directed evolution data of the target enzyme. Here, each field consists of an interrelated field key and field value, wherein the field key indicates the type of the field, and the field value indicates the specific content corresponding to the field type.

[0149] Because computer programming languages ​​such as JavaScript (i.e., ECMAScript, a computer programming language defined by the international standard ECMA-262) have built-in basic data types that provide support for dictionary structures, modern NoSQL (No-only Structural Query Language) database systems such as IndexedDB, Redis, and MongoDB directly support dictionary structures as a way to store data. Furthermore, CAM (Content-Addressable Memory) also implements hardware-level support for dictionary structures. ECD structures stored in a dictionary format can be directly understood and processed by computer programs without additional data processing. In addition, because dictionary structures store data in a one-to-one key-value relationship, their retrieval efficiency is far higher than other relational storage methods, making them more suitable for the large-scale storage, indexing, and computational optimization design of enzyme-directed evolution data.

[0150] Furthermore, after completing the structured processing of the enzyme's directed evolution data and determining the ECD structure of the target enzyme, the present invention can also use the ECD structure to construct a retrieval candidate set of the enzyme's directed evolution data, so that the retrieval device can efficiently retrieve multi-dimensional data involved in the directed evolution process of the enzyme based on the input retrieval information.

[0151] Specifically, in constructing the retrieval candidate set, the storage device can first define at least one data structure based on the supported retrieval information types. For example, for a scheme supporting enzyme sequence-based retrieval, the data structure correspondingly includes a sequence node structure. This sequence node structure includes at least a directed evolution data field and an alignment information field. Here, the data type of the directed evolution data field can be the aforementioned ECD structure, i.e., constituting an ecd field, to comprehensively integrate the multi-dimensional data involved in the directed evolution process of the enzyme, characterize the catalytic performance of the enzyme, and facilitate efficient retrieval of this multi-dimensional data. The data type of the alignment information field is adapted to the type of retrieval information and uses the FASTA format, i.e., constituting an alignment sequence field, to represent the nucleic acid sequence or polypeptide sequence of the corresponding candidate enzyme.

[0152] Subsequently, the storage device can create a variable S in the computer memory and define its data type as an array consisting of at least one sequence node structure to construct the corresponding sequence candidate set to accommodate the sequence node structures of various candidate enzymes in the EvoCloud database.

[0153] Next, the storage device can traverse the EvoCloud database based on the [A11] sequence field, construct a sequence node structure for each member e[i].sequences[j] that contains the [A11] field, initialize its ecd field ecd = e[i].ecd, initialize its alignment sequence field sequence = e[i].sequences[j], and add the sequence node structure to the end of array S and push it onto the stack to construct the sequence candidate set corresponding to the sequence retrieval function.

[0154] Furthermore, in some embodiments, the sequence node structure may preferably include a unique ID field. During the construction of the sequence candidate set, the storage device can traverse the EvoCloud database based on the [A4]id field, creating a sequence node structure for each piece of data, initializing its ecd field ecd = e[i].ecd, initializing its unique ID field id = e[i].id, initializing its alignment sequence field sequence = e[i].sequences[j], and adding this sequence node structure to the end of array S and pushing it onto the stack to initially construct the sequence candidate set corresponding to the sequence retrieval function. Afterwards, for sequence node structures with an empty alignment sequence field (i.e., enzymes in the EvoCloud database that do not contain the [A11] field), the storage device can preferably output a prompt message "Supplement the content of the alignment sequence field" to prompt technicians to synthesize, express, and experimentally verify the missing sequence information in the EvoCloud database.

[0155] Those skilled in the art will understand that the sequence candidate set constructed for the sequence retrieval function described above is only a non-limiting implementation of the present invention, intended to clearly demonstrate the main concept of the present invention and provide a specific solution that is easy for the public to implement, rather than being used to limit the scope of protection of the present invention.

[0156] Optionally, in other embodiments, for schemes supporting enzyme structure-based retrieval, the defined data structure correspondingly includes a structureNode structure. This structureNode structure includes at least a directed evolution data field and an alignment information field. Here, the data type of the directed evolution data field can be the aforementioned ECD structure, i.e., constituting an ecd field, to comprehensively integrate the multi-dimensional data involved in the directed evolution process of the enzyme, characterize the enzyme's catalytic performance, and facilitate efficient retrieval of this multi-dimensional data. The data type of the alignment information field is adapted to the type of retrieval information and uses the PDB format, i.e., constituting an alignment structure field, to completely record the structure and related information of the candidate enzyme.

[0157] Subsequently, the storage device can create a variable S in the computer memory and define its data type as an array consisting of at least one structureNode structure to construct a corresponding structure candidate set to accommodate the structureNode structures of various candidate enzymes in the EvoCloud database.

[0158] Next, the storage device can traverse the EvoCloud database based on the [A12] structure field, construct a structure node structure for each member e[i].structures[j] that contains the [A12] field, initialize its ecd field ecd = e[i].ecd, initialize its comparison structure field structure = e[i].structures[j], and add the structure node structure to the end of the array S and push it onto the stack to construct the structure candidate set corresponding to the structure retrieval function.

[0159] Furthermore, in some embodiments, the structureNode structure may preferably include a unique ID field. During the construction of the structure candidate set, the storage device can traverse the EvoCloud database based on the [A4]id field, creating a structureNode structure for each piece of data, initializing its ecd field ecd = e[i].ecd, initializing its unique ID field id = e[i].id, initializing its alignment structure field structure = e[i].structures[j], and adding the structureNode structure to the end of array S and pushing it onto the stack to initially construct the structure candidate set corresponding to the structure retrieval function. Then, for structureNode structures with an empty alignment structure field (i.e., enzymes in the EvoCloud database that do not contain the [A12] field), the storage device can preferably infer the enzyme structure based on the sequence data recorded in its [A11]sequences field, and fill the inferred structure data into the structureNode structure in PDB format to supplement the content of its alignment structure field. The ability to infer enzyme structure from sequence data can be achieved using existing algorithms, which will not be elaborated here.

[0160] Optionally, in other embodiments, for schemes supporting ligand-based small molecule retrieval, the defined data structure correspondingly includes a ligand node structure. This ligand node structure includes at least a directed evolution data field and an alignment information field. Here, the data type of the directed evolution data field can be the aforementioned ECD structure, i.e., constituting an ecd field, to comprehensively integrate the multi-dimensional data involved in the directed evolution process of the enzyme, characterize the enzyme's catalytic performance, and facilitate efficient retrieval of this multi-dimensional data. The data type of the alignment information field is adapted to the type of retrieval information and uses the SMILES format, i.e., constituting an alignment ligand (smiles) field, thereby explicitly describing the molecular structure of the ligand using ASCII strings.

[0161] Subsequently, the storage device can create a variable S in the computer memory and define its data type as an array consisting of at least one ligand node structure to construct the corresponding ligand candidate set to accommodate the ligand node structures of various candidate enzymes in the EvoCloud database.

[0162] Next, the storage device can traverse the EvoCloud database based on the [A9-3] product field and the [A9-5] substrate field, and construct a ligand node structure for each member e[i].product.smiles and e[i].substrates[j].smiles that contains the [A9-3] field or the [A9-5] field, initialize its ecd field ecd = e[i].ecd, initialize its alignment ligand (smiles) field smiles = e[i].substrates[j].smiles or e[i].product.smiles, and add the ligand node structure to the end of the array S and push it onto the stack to construct the ligand candidate set corresponding to the ligand retrieval function.

[0163] Furthermore, in some embodiments, the ligand node structure may preferably include a unique ID field. During the construction of the ligand candidate set, the storage device can traverse the EvoCloud database based on the [A4]id field, creating a ligand node structure for each piece of data, initializing its ecd field ecd = e[i].ecd, initializing its unique ID field id = e[i].id, initializing its alignment ligand (smiles) field smiles = e[i].substrates[j].smiles or e[i].product.smiles, and adding the ligand node structure to the end of array S and pushing it onto the stack to initially construct the ligand candidate set corresponding to the ligand retrieval function. Subsequently, for ligand node structures where the alignment ligand (smiles) field is empty (i.e. enzymes that do not contain the [A9-3] and [A9-5] fields in the EvoCloud database), the storage device can preferably output a prompt message "supplement the content of the alignment ligand (smiles) field" to prompt technicians to synthesize, express, and experimentally verify the missing ligand information in the EvoCloud database.

[0164] Based on the above description, the method, apparatus, and computer-readable storage medium for processing directed evolution data of enzymes provided by the present invention can integrate multi-dimensional data involved in the directed evolution process of enzymes and characterize the catalytic performance of enzymes, thereby facilitating efficient retrieval of these multi-dimensional data during the online retrieval stage.

[0165] Subsequently, during the online retrieval phase, the retrieval device can obtain retrieval information 's' about the target enzyme via a human-computer interaction interface and determine the corresponding candidate set based on this retrieval information. Then, the retrieval device can determine the similarity score between the retrieval information 's' and the alignment information fields of each data structure in the candidate set, and select at least one data structure similar to the target enzyme from the candidate set according to a preset similarity threshold. It then outputs the directed evolution data fields and / or alignment information fields of each data structure similar to the target enzyme, thereby achieving efficient retrieval of directed evolutionary multidimensional data of the target enzyme.

[0166] Please refer to the details. Figure 4 , Figure 4 A flowchart illustrating a method for processing directed evolution data of enzymes according to some embodiments of the present invention is shown. Figure 4 As shown, in some embodiments of the sequence retrieval function, a technician can input retrieval information s containing a specific sequence of the target enzyme via the human-computer interaction interface of the retrieval device. In response to obtaining the retrieval information s, the retrieval device can determine the enzyme sequence data belonging to the above-mentioned [A11] field based on its data structure and / or data content, and accordingly retrieve the pre-constructed sequence candidate set.

[0167] Subsequently, the retrieval device can perform multiple sequence alignment (MSA) calculations between the retrieved information s and the alignment sequence fields S[k].sequence of all members S[k] (i.e., the sequence node structure mentioned above) in the sequence candidate set S, to determine the similarity score between the sequence content s of the target enzyme and each alignment sequence field S[k].sequence. Here, the MSA calculation can be implemented based on various existing algorithms and programs such as basic pairwise comparisons BLAST, ClustalW, T-Coffee, Muscle, and MAFFT, which will not be elaborated upon here.

[0168] If the similarity score is greater than or equal to a pre-set first similarity threshold (e.g., 0.3), the retrieval device can assign the similarity score to the similarity score field S[k].identityScore of the corresponding sequence node structure S[k]. Conversely, if the similarity score is less than the pre-set first similarity threshold (e.g., 0.3), the retrieval device can delete this member S[k] from the sequence candidate set S. Thus, in response to completing the MSA calculation of all members S[k] in the entire sequence candidate set S, the results of batch retrieval based on the input sequence s can be formed in the final sequence candidate set S'. Afterwards, the retrieval device can determine each associated enzyme similar to the target enzyme based on each sequence node structure S[k] in the final sequence candidate set S', and output the ecd field, id field, similarity score field and / or alignment sequence field in its sequenceNode structure to achieve efficient retrieval of directed evolutionary multidimensional data of the target enzyme.

[0169] In addition, please refer to Figure 5 , Figure 5 A flowchart illustrating a method for processing directed evolution data of enzymes according to some embodiments of the present invention is shown. Figure 5 As shown, in some embodiments of the structure retrieval function, a technician can input retrieval information s containing a specific structure of the target enzyme via the human-computer interaction interface of the retrieval device. In response to obtaining the retrieval information s, the retrieval device can determine the enzyme structure data belonging to the above-mentioned [A12] field based on its data structure and / or data content, and accordingly retrieve the pre-constructed structure candidate set.

[0170] Subsequently, the retrieval device can perform structure alignment calculations between the retrieved information s and the alignment structure field S[k].structure of all members S[k] (i.e., the aforementioned structure nodes / structures) in the structure candidate set S, to determine the similarity score between the structural content s of the target enzyme and each alignment structure field S[k].structure. Here, the structure alignment calculation can be implemented based on various existing algorithms and programs such as jFATCAT-rigid, jFATCAT-flexible, jCE, jCE-CP, TM-align, and Smith-Waterman3D, which will not be elaborated upon here.

[0171] If the similarity score is greater than or equal to a pre-set second similarity threshold (e.g., 30%), the retrieval device can assign the similarity score to the similarity score field S[k].similarityScore of the corresponding structure node structure S[k]. Conversely, if the similarity score is less than the pre-set second similarity threshold (e.g., 30%), the retrieval device can delete this member S[k] from the structure candidate set S. Thus, in response to completing the structure alignment calculation of all members S[k] in the entire structure candidate set S, the results of batch retrieval based on the input structure s can be formed in the final structure candidate set S'. Afterwards, the retrieval device can determine each associated enzyme similar to the target enzyme based on each structure node structure S[k] in the final structure candidate set S', and output the ecd field, id field, similarity score field and / or alignment structure field in its structureNode structure to achieve efficient retrieval of the targeted evolutionary multidimensional data of the target enzyme.

[0172] Those skilled in the art will understand that the above-described scheme for determining whether each member S[k] in the candidate structure set S is similar to the structural content s of the target enzyme based on the percentage-based structural similarity score is merely a non-limiting implementation provided by the present invention. It is intended to clearly demonstrate the main concept of the present invention and provide some specific solutions that are easy for the public to implement, rather than to limit the scope of protection of the present invention.

[0173] Optionally, in other embodiments, those skilled in the art can also calculate the root mean squared deviation (RMSD) between the multidimensional structural data of each member S[k] in the structure candidate set S and the multidimensional structural content s of the target enzyme, to characterize the structural deviation between each member S[k] in the structure candidate set S and the target enzyme. In this way, the retrieval device can determine whether each member S[k] in the structure candidate set S is similar to the structural content s of the target enzyme based on whether the RMSD value is less than a preset RMSD threshold, thereby achieving the same effect of structural alignment calculation.

[0174] In addition, please refer to Figure 6 , Figure 6 A flowchart illustrating a method for processing directed evolution data of enzymes according to some embodiments of the present invention is shown. Figure 6As shown, in some embodiments of the small molecule search function, a technician can input search information s about a specific ligand molecule of the target enzyme via the human-computer interface of the search device. In response to obtaining the search information s, the search device can determine the product data belonging to the above-mentioned [A9-3] field and / or substrate data belonging to the [A9-5] field based on its data structure and / or data content, and accordingly retrieve the pre-constructed ligand candidate set.

[0175] Subsequently, the retrieval device can perform molecular similarity calculations between the retrieved information s and the alignment ligand field S[k].smiles of all members S[k] (i.e., the aforementioned ligand node structure ligandNode) in the ligand candidate set S, to determine the similarity score between the ligand molecular information s for the target enzyme and each alignment ligand field S[k].smiles. This molecular similarity calculation can be implemented using various existing algorithms and programs such as Tanimoto, Topological Fingerprint, and Morgan Fingerprint, which will not be elaborated upon here.

[0176] If the similarity score is greater than or equal to a pre-set third similarity threshold (e.g., 0.5), the retrieval device can assign the similarity score to the similarity score field S[k].similarityScore of the corresponding ligand node structure S[k]. Conversely, if the similarity score is less than the pre-set third similarity threshold (e.g., 0.5), the retrieval device can delete this member S[k] from the ligand candidate set S. Thus, in response to completing the molecular similarity calculation of all members S[k] in the entire ligand candidate set S, the results of batch retrieval based on the input molecular information s can be formed in the final ligand candidate set S'. Afterwards, the retrieval device can determine each associated enzyme similar to the target enzyme based on each ligand node structure S[k] in the final ligand candidate set S', and output the ecd field, id field, similarity score field and / or aligned ligand (smiles) field in its ligandNode structure to achieve efficient retrieval of multidimensional data of the target enzyme's directed evolution.

[0177] In summary, by constructing a comprehensive ECD structure that integrates multi-dimensional data involved in the directed evolution of enzymes, characterizes the catalytic performance of enzymes, and constructs a data structure containing the directed evolution data fields of this ECD, this invention can integrate multi-dimensional data involved in the directed evolution of enzymes, characterize the catalytic performance of enzymes, and achieve efficient storage and retrieval of this multi-dimensional data. Furthermore, by constructing various data structures such as sequence node, structure node, and ligand node structures, and various candidate sets such as sequence candidate sets, structure candidate sets, and ligand candidate sets, this invention can adapt to different types of retrieval information provided by users to provide corresponding retrieval functions, thereby further reducing the difficulty of retrieving directed evolution data of enzymes, expanding the retrieval scope of directed evolution data of enzymes, and improving the retrieval efficiency of directed evolution data of enzymes.

[0178] Although the methods described above are illustrated and depicted as a series of actions for the sake of simplicity, it should be understood and appreciated that these methods are not limited by the order of the actions, as some actions may occur in a different order and / or concurrently with other actions from the illustrations and descriptions herein or not illustrated and described herein but which may be understood by those skilled in the art, according to one or more embodiments.

[0179] Those skilled in the art will understand that information, signals, and data can be represented using any of a variety of different techniques and skills. For example, the data, instructions, commands, information, signals, bits, symbols, and chips described throughout the above description can be represented by voltage, current, electromagnetic waves, magnetic fields or magnetic particles, light fields or optical particles, or any combination thereof.

[0180] Those skilled in the art will further appreciate that the various illustrative logic blocks, modules, circuits, and algorithm steps described in conjunction with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or a combination of both. To clearly illustrate this interchangeability between hardware and software, the various illustrative components, blocks, modules, circuits, and steps are described above in a generalized manner in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. Those skilled in the art may implement the described functionality in different ways for each specific application, but such implementation decisions should not be construed as departing from the scope of the invention.

[0181] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of both. The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to a processor such that the processor can read and write information to / from the storage medium. In an alternative, the storage medium may be integrated into the processor. The processor and storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In an alternative, the processor and storage medium may reside as discrete components in the user terminal.

[0182] In one or more exemplary embodiments, the described functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software as a computer program product, the functionality may be stored or transmitted as one or more instructions or code on or through a computer-readable medium. A computer-readable medium includes both computer storage media and communication media, encompassing any medium that facilitates the transfer of a computer program from one location to another. A storage medium may be any available medium accessible to a computer. By way of example and not limitation, such a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disc storage, disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and is accessible to a computer. Any connection is also legitimately referred to as a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of a medium. As used in this article, disk and disc include compact discs (CDs), laser discs, optical discs, digital multi-purpose discs (DVDs), floppy disks, and Blu-ray discs. Disks typically reproduce data magnetically, while discs reproduce data optically using lasers. Combinations of these should also be included within the scope of computer-readable media.

[0183] The prior description of this disclosure is provided to enable any person skilled in the art to make or use this disclosure. Various modifications to this disclosure will be apparent to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of this disclosure. Therefore, this disclosure is not intended to be limited to the examples and designs described herein, but should be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for processing directed evolution data of enzymes, characterized in that, Includes the following steps: Obtain search information about the target enzyme, wherein the search information includes sequence content containing a specific sequence of the target enzyme; Based on the retrieval information, a corresponding candidate set is determined, wherein the candidate set includes a sequence candidate set, which includes at least one data structure, the data structure including a sequence node structure, which includes at least a directed evolution data field and an alignment information field, the alignment information field including an alignment sequence field, each of the alignment sequence fields recording an enzyme sequence; Multiple sequence alignment operations are performed on the sequence content of the target enzyme and the alignment sequence fields of each sequence node structure in the sequence candidate set to determine the similarity score between the sequence content of the target enzyme and each alignment sequence field. Based on a preset similarity threshold, at least one data structure similar to the target enzyme is selected from the candidate set; and Output the directed evolution data fields and / or alignment information fields of each data structure that is similar to the target enzyme.

2. The processing method as described in claim 1, characterized in that, The data structure also includes a similarity score field, and the processing method further includes the following steps: The similarity score and the similarity threshold are compared and judged. In response to the similarity score being greater than or equal to the similarity threshold, the similarity score is recorded in the similarity score field of the corresponding data structure; and Output the similarity score field for each data structure that is similar to the target enzyme.

3. The processing method as described in claim 1, characterized in that, Before determining the corresponding candidate set based on the search information, the processing method further includes the following steps: Construct a database to store the directed evolution data of at least one enzyme; Based on the supported retrieval information types, at least one of the data structures is defined, wherein the data structure includes at least one of a sequence node structure, a structure node structure, and a ligand node structure. Construct at least one candidate set to accommodate the data structure, wherein each candidate set accommodates one type of the data structure; The database is traversed according to the comparison information fields corresponding to each of the aforementioned retrieval information types, and a data structure is created for each piece of data in the database involving each of the aforementioned comparison information fields; and Based on the type of retrieval information, the created data structure is added to the corresponding candidate set.

4. The processing method as described in claim 3, characterized in that, The data structure also includes a unique identifier field. Before determining the corresponding candidate set based on the retrieval information, the processing method further includes the following steps: Iterate through the database based on the unique identifier field, and create a data structure for each piece of data in the database; and For data structures that do not involve the comparison information field, determine the content of the comparison information field based on the content recorded in the directed evolution data field, and / or output a prompt to supplement the content of the comparison information field.

5. The processing method as described in claim 1, characterized in that, The directed evolution data fields include at least one of the following: alternative name field, wild ancestor field, expression system field, unique number field, scientific name field, remarks field, biological origin field, parental field, catalytic performance field, reference field, enzyme sequence field, and enzyme structure field.

6. The processing method as described in claim 5, characterized in that, The expression system field adopts an expression system structure, which includes at least one of the following: expression host field, remarks field, enzyme-carrying gene field, and unique number field, and / or The catalytic performance field adopts a catalytic performance structure, which includes at least one of the following: reaction conditions field, experiment number field, reaction product field, reaction reagent field, and reaction substrate field, and / or The reference field adopts a reference structure, which includes at least one of the following: citation source field, publication date field, remarks field, citation URI field, and citation title field, and / or The enzyme sequence field adopts an enzyme sequence structure, which includes at least one of the following: GenBank accession number, GI number, mutation field, remarks field, citation URI field, sequence content field, sequence URI field, sequence category field, and UniProt ID field, and / or The enzyme structure field adopts an enzyme structure structure, which includes at least one of the following: ligand field, mutation field, memo field, citation URI field, sequence URI field, structure URI field, structure content field, and structure category field.

7. The processing method as described in claim 6, characterized in that, The reaction condition field adopts a reaction condition structure, which includes at least one of the following: relative humidity field, pH field, reaction time field, reactor field, and reaction temperature field, and / or The reaction product field adopts a reaction product structure, which includes at least one of the following: reaction conversion rate field, product diastereomer excess value field, product diastereomer ratio field, product enantiomer excess value field, enantiomer selectivity field, product enantiomer ratio field, product molecule field, product purity field, product positional isomer excess value field, product positional isomer ratio field, separation yield field, and in-situ yield field, and / or The reaction reagent field comprises at least one reaction reagent structure, each reaction reagent structure corresponding to a reaction reagent, wherein at least one of the following is included: addition method field, dilution method field, amount added field, reagent molecule field, and / or The reaction substrate field consists of at least one reaction substrate structure, each of which corresponds to a reaction substrate, and includes at least one of the following: addition method field, dilution method field, amount added field, and substrate molecule field.

8. The processing method as described in claim 7, characterized in that, The reactor field adopts a reactor structure, which includes at least one of the following: a stirring field, a diameter field, a height field, and a shape field, and / or At least one of the product molecule field, the reagent molecule field, and the substrate molecule field adopts a ligand structure, wherein the ligand structure includes at least one of the following: CAS number field, InChI field, scientific name field, SMILES field, and chemical structure field, and / or The addition method field adopts an addition method structure, which includes at least one of the following: method field, speed field, and time point field, and / or The dilution method field uses a dilution structure, which includes at least a diluent field and / or a diluent dosage field, wherein the diluent field uses the ligand structure, and / or The input quantity field adopts a physical quantity structure, which includes at least one of the following: a lower limit field, a target value field, a unit field, and an upper limit field.

9. The processing method as described in claim 8, characterized in that, The stirring field adopts a stirring structure, which includes at least one of the stirring amplitude field, stirring method field, and stirring speed field.

10. The processing method as described in claim 6, characterized in that, The mutation field consists of at least one mutation structure, each mutation structure corresponding to a mutation, and includes at least one of the following: mutation sequence fragment field, mutation location field, template sequence fragment field, and mutation type field.

11. The processing method according to any one of claims 1 to 10, characterized in that, Each of the fields consists of an interrelated field key and field value, wherein the field key indicates the field type and the field value indicates the content corresponding to the field type.

12. A method for processing directed evolution data of enzymes, characterized in that, Includes the following steps: Obtain search information about the target enzyme, wherein the search information includes structural content containing a specific structure of the target enzyme; Based on the retrieval information, a corresponding candidate set is determined, wherein the candidate set includes a structure candidate set, which includes at least one data structure, the data structure including a structure node structure, which includes at least a directed evolution data field and a comparison information field, the comparison information field including a comparison structure field, each of the comparison structure fields recording an enzyme structure; A structure alignment operation is performed on the structural content of the target enzyme and the alignment structural fields of each structural node in the candidate structure set to determine the similarity score between the structural content of the target enzyme and each alignment structural field. Based on a preset similarity threshold, at least one data structure similar to the target enzyme is selected from the candidate set; and Output the directed evolution data fields and / or alignment information fields of each data structure that is similar to the target enzyme.

13. A method for processing directed evolution data of enzymes, characterized in that, Includes the following steps: Obtain search information about the target enzyme, wherein the search information includes molecular information about specific ligands of the target enzyme; Based on the retrieval information, a corresponding candidate set is determined, wherein the candidate set includes a ligand candidate set, which includes at least one data structure, the data structure including a ligand node structure, which includes at least a directed evolution data field and a comparison information field, the comparison information field including a comparison ligand field, each of the comparison ligand fields recording molecular information of a ligand; Molecular similarity calculations are performed on the molecular information of the specific ligands of the target enzyme and the alignment ligand fields of each ligand node structure in the ligand candidate set to determine the similarity score between the molecular information of the specific ligands of the target enzyme and each alignment ligand field. Based on a preset similarity threshold, at least one data structure similar to the target enzyme is selected from the candidate set; and Output the directed evolution data fields and / or alignment information fields of each data structure that is similar to the target enzyme.

14. A device for processing directed evolution data of enzymes, characterized in that, include: Memory, and A processor, connected to the memory, and configured to implement a method for processing directed evolutionary data of enzymes as described in any one of claims 1 to 13.

15. A computer-readable storage medium storing computer instructions thereon, characterized in that, When the computer instructions are executed by the processor, the method for processing enzyme-directed evolution data as described in any one of claims 1 to 13 is implemented.