Sequence-based machine learning enabled protein molecule design with protein structure representations
By using structural representations to guide sequence-based machine learning, the method addresses resource and data challenges in molecular design, enhancing the likelihood of generating protein sequences with desired properties for large molecule drugs.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GENENTECH INC
- Filing Date
- 2025-12-23
- Publication Date
- 2026-07-02
AI Technical Summary
Conventional molecular design methods for large molecule drugs, such as antibodies and T-cell receptors, face challenges in efficiently exploring sequence modifications to enhance desirable properties while minimizing undesirable traits like immunogenicity and instability, due to resource limitations and data scarcity, leading to poor performance of computation models.
Incorporating structural representations, such as canonical conformation signatures and classification tokens, to guide sequence-based machine learning for generating protein molecules, ensuring similarity in three-dimensional structure and properties, using models trained on diverse datasets to preserve conformation and predict desirable traits.
Enhances the likelihood of generating protein sequences with desired properties by leveraging structural coherence, overcoming data limitations and resource constraints, resulting in more effective lead optimization and drug development.
Smart Images

Figure US2025061181_02072026_PF_FP_ABST
Abstract
Description
Attorney Ref.: 14786-116-228 (103963-228116) / P39820-WO-1SEQUENCE-BASED MACHINE LEARNING ENABLED PROTEIN MOLECULE DESIGN WITH PROTEIN STRUCTURE REPRESENTATIONS CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to U. S. Provisional Application No. 63 / 739,016, entitled “SEQUENCE- STRUCTURE COHERENCE FOR PROTEIN MOLECULE DESIGN” and fded on December 26, 2024, U. S. Provisional Application No. 63 / 901,206, entitled “SEQUENCE-STRUCTURE COHERENCE FOR PROTEIN MOLECULE DESIGN” and fded on October 17, 2025, U. S. Provisional Application No. 63 / 879,205, entitled “MULTIMODAL COMPLEMENTARITY DETERMINING REGIONS TOKENIZER” and fded on September 10, 2025, the disclosures of which are incorporated herein by reference in their entireties.TECHNICAL FIELD
[0002] The subject matter described herein relates generally to molecular design, and more specifically, to sequence-based machine learning enabled generation of protein molecules that incorporates protein structure representations.INTRODUCTION
[0003] A molecule is a group of two more atoms held together by chemical bonds. Molecules form the smallest identifiable unit into which a pure substance can be divided while still retaining the composition and chemical properties of that substance. Various properties of a molecule, including its ability to function as a therapeutic, may be contingent upon its composition and conformation (or three-dimensional structure). Large molecules (also known as biopharmaceuticals, biologicals, or biologies) are molecules ranging between approximately 3000 Daltons and 150,000 Daltons in molecular weight. Large molecule drugs are often derivatives of natural human proteins, which modulate many essential cellular functions such as enzymatic reactions, transport of molecules, regulation and execution of a number of biological pathways,cell growth, proliferation, nutrient uptake, morphology, motility, intercellular communication, and / or the like. Examples of therapeutic proteins include antibodies, chimeric antigen receptors (CARs), enzymes, hormones, cytokines, and / or the like. A single large molecule can have more than 1,300 amino acid residues, which are linked by peptide bonds to form one or more polypeptide. Due to their size and complexity, large molecule drugs are recombinantly produced by engineered cells instead of being chemically synthesized like the majority of small molecule drugs. Moreover, large molecule therapeutics are usually delivered through injection or infusion due to the ineffectiveness of oral administration. The development of a large molecule drug may entail designing one or more sequences of amino acid residues capable of binding to a target (e.g., a protein, a nucleic acid, and / or the like) with sufficient specificity and absent undesirable traits such as immunogenicity, self-association, instability, and / or the like.SUMMARY
[0004] Systems, methods, and articles of manufacture, including computer program products, are provided for sequence-based machine learning enabled techniques for generating protein molecules, including antigen receptors such as antibodies and T-cell receptors (TCRs), in which the generation of novel protein sequences (or portions thereof) is guided by structural representations of protein molecules. In some cases, the generation of novel protein sequences may be guided by a sequence-structure coherence (SSC) metric encoding the three-dimensional structural features of each protein sequence. For example, in some cases, the guided generation of protein sequences may include generating a novel protein sequence whose canonical conformation signature, which specifies the canonical conformation class of the protein sequence, is sufficiently similar to the canonical conformation signature of a seed molecule exhibiting one or more properties of interest. Alternatively, the generation of novel protein sequences may be guided by2131251356v lone or more classification tokens categorizing the three-dimensional structure of protein molecules. In some cases, the guided generation of novel protein sequences may include generating a novel protein sequences having one or more sufficiently similar classification tokens as a seed molecule exhibiting one or more properties of interest.
[0005] In one aspect, there is provide a system for sequence-based machine learning enabled techniques for generating protein molecules guided by a structure-sequence coherence (SSC) metric. The system may include at least one data processor and at least one memory. The at least one memory may store instructions that result in operations when executed by the at least one data processor. The operations may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a canonical conformation computation model, a canonical conformation signature for the input molecule, where the canonical conformation signature of the input molecule is based at least on the amino acid residue sequence of the input molecule or the canonical conformation signature of the input molecule is; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the canonical conformation signature of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0006] In another aspect, there is provided a computer-implemented method for sequencebased machine learning enabled techniques for generating protein molecules guided by a structuresequence coherence (SSC) metric. The method may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a canonical conformation computation model, a canonical conformation signature for the input molecule, where the canonical conformation signature of the input molecule is based at least on the amino3131251356v 1acid residue sequence of the input molecule or the canonical conformation signature of the input molecule is; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the canonical conformation signature of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0007] In another aspect, there is provided a computer program product for sequence-based machine learning enabled techniques for generating protein molecules guided by a structuresequence coherence (SSC) metric. The computer program product may include a non-transitory computer readable medium storing instructions that result in operations when executed by at least one data processor. The operations may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a canonical conformation computation model, a canonical conformation signature for the input molecule, where the canonical conformation signature of the input molecule is based at least on the amino acid residue sequence of the input molecule or the canonical conformation signature of the input molecule is; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the canonical conformation signature of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0008] In some variations, one or more features disclosed herein including the following features can optionally be included in any feasible combination.
[0009] In some variations, the modifying the amino acid residue sequence of the input molecule is guided by the canonical conformation signature of the input molecule such that a conformation of the output molecule is similar to the conformation of the input molecule.4131251356v 1
[0010] In some variations, the canonical conformation computation model is trained to generate the canonical conformation signature of the input molecule by at least determining, for each canonical conformation class in a set of canonical conformation classes, a probability of the input molecule exhibiting a canonical conformation associated with the canonical conformation class.
[0011] In some variations, the canonical conformation signature of the input molecule comprises a probability distribution across each canonical conformation class in the set of canonical conformation classes.
[0012] In some variations, the set of canonical conformation classes is determined based at least on a conformation of each known protein molecule of a plurality of known protein molecules.
[0013] In some variations, the set of canonical conformation classes are determined by at least clustering, into one or more clusters, a dihedral representation of the conformation of each known protein molecule. Each cluster of the one or more clusters corresponds to a canonical conformation class.
[0014] In some variations, each canonical conformation class includes two or more protein molecules whose conformations exhibit a threshold similarity.
[0015] In some variations, the set of canonical conformation classes include a first canonical conformation class associated with a heavy chain of an antibody and a second canonical conformation class associated with a light chain of the antibody.
[0016] In some variations, the amino acid residue sequence comprising the input molecule includes one or more tokens separating a plurality of amino acid residues forming the5131251356v lheavy chain of the antibody and a plurality of amino acid residues forming the light chain of the antibody.
[0017] In some variations, a molecule design computation model is applied to modify the amino acid residue sequence of the input molecule.
[0018] In some variations, an intermediate molecule is generated by applying the molecule design computation model to modify the amino acid residue sequence of the input molecule. The canonical conformation computation model is applied to determine, based at least on an amino acid residue sequence of the intermediate molecule, a canonical conformation signature of the intermediate molecule. The intermediate molecule is identified as the output molecule based at least on the canonical conformation signature of the intermediate molecule exhibiting a threshold similarity to the canonical conformation signature of the input molecule.
[0019] In some variations, a similarity metric quantifying a similarity between the canonical confirmation signature of the intermediate molecule and the canonical conformation signature of the input molecule is determined.
[0020] In some variations, the similarity metric comprises a Kullback-Leibler (KL) divergence or a Wasserstein metric.
[0021] In some variations, the intermediate molecule is identified as the output molecule further based on the canonical conformation of the intermediate molecule exhibiting a greater similarity to the canonical conformation signature of the input molecule than a canonical conformation signature of another intermediate molecule.
[0022] In some variations, the intermediate molecule is identified as the output molecule further based on the intermediate molecule exhibiting one or more desirable properties.6131251356v l
[0023] In some variations, the amino acid residue sequence of the input molecule is modified by inserting, deleting, and / or changing an identity of one or more amino acid residues in the amino acid residue sequence.
[0024] In some variations, the input molecule comprises an antigen receptor.
[0025] In some variations, the input molecule comprises a complementarity determining region (CDR) of an antibody.
[0026] In some variations, the input molecule comprises a variable domain of a heavy chain of an antibody and / or a variable domain of a light chain of the antibody.
[0027] In some variations, the canonical conformation computation model has been trained based on a training dataset that includes, for each known protein molecule of a plurality of known protein molecules, an amino acid residue sequence of the known protein molecule and a ground-truth canonical computation signature.
[0028] In some variations, the molecule design computation model comprises a diffusion model that has been trained to generate the output molecule by making incremental modifications to the amino acid residue sequence of the input molecule.
[0029] In some variations, the canonical conformation computation model comprises an ensemble of an N quantity of ensemble elements. Each ensemble element includes a canonical conformation classifier trained to determine a canonical conformation signature of a different one of a complementarity determining region (CDR) in the input molecule.
[0030] In some variations, the canonical conformation signature of the input molecule is determined based on an output from every ensemble element of N quantity of ensemble elements.7131251356v l
[0031] In another aspect, there is provide a system for sequence-based machine learning enabled techniques for generating protein molecules guided by a structural classification token (CLS). The system may include at least one data processor and at least one memory. The at least one memory may store instructions that result in operations when executed by the at least one data processor. The operations may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a tokenization model, a multimodal representation of the input molecule, where the multimodal representation of the input molecule includes, for each amino acid residue in the amino acid residue sequence of the input molecule, a multimodal token encoding an identity and one or more backbone dihedral angles of the amino acid residue, and where the multimodal representation of the input molecule further includes a classification (CLS) token encoding a three-dimensional structure of the input molecule; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the multimodal representation of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0032] In another aspect, there is provided a computer-implemented method for sequence-based machine learning enabled techniques for generating protein molecules guided by a structural classification (CLS) token. The method may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a tokenization model, a multimodal representation of the input molecule, where the multimodal representation of the input molecule includes, for each amino acid residue in the amino acid residue sequence of the input molecule, a multimodal token encoding an identity and one or more backbone dihedral angles of the amino acid residue, and where the multimodal representation of8131251356v 1the input molecule further includes a classification (CLS) token encoding a three-dimensional structure of the input molecule; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the multimodal representation of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0033] In another aspect, there is provided a computer program product for sequence-based machine learning enabled techniques for generating protein molecules guided by a structural classification (CLS) token. The computer program product may include a non-transitory computer readable medium storing instructions that result in operations when executed by at least one data processor. The operations may include: identifying an input molecule, where the input molecule comprises an amino acid residue sequence; generating, using a tokenization model, a multimodal representation of the input molecule, where the multimodal representation of the input molecule includes, for each amino acid residue in the amino acid residue sequence of the input molecule, a multimodal token encoding an identity and one or more backbone dihedral angles of the amino acid residue, and where the multimodal representation of the input molecule further includes a classification (CLS) token encoding a three-dimensional structure of the input molecule; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the multimodal representation of the input molecule; and generating an output molecule using the modified amino acid residue sequence of the input molecule.
[0034] In some variations, one or more features disclosed herein including the following features can optionally be included in any feasible combination.9131251356v l
[0035] In some variations, the modifying the amino acid residue sequence of the input molecule is guided by the multimodal representation of the input molecule such that a three-dimensional structure of the output molecule is similar to the three-dimensional structure of the input molecule.
[0036] In some variations, the modifying the amino acid residue sequence of the input molecule is guided by the classification (CLS) token encoding the three-dimensional structure of the input molecule.
[0037] In some variations, the amino acid residue sequence of the input molecule is modified such that the modified amino acid residue sequence is assigned a same classification (CLS) token.
[0038] In some variations, the multimodal token of each amino acid residue encodes a phi ( ) angle, a psi (0) angle, and an omega (a>) angle comprising a backbone of the amino acid residue.
[0039] In some variations, the multimodal token representation of the input molecule is generated by at least for each amino acid residue in the amino acid residue sequence of the input molecule, combining the identity and the one or more backbone dihedral angles of the amino acid residue by element wise addition, and encoding a combination of the identity and the one or more backbone dihedral angles to generate the multimodal token of the amino acid residue.
[0040] In some variations, a property computation model is applied to determine, based on at least a portion of the multimodal representation of the input molecule, one or more properties of the input molecule.
[0041] In some variations, the property computation model determines, based at least on the classification (CLS) token, the one or more properties of the input molecule.10131251356v l
[0042] In some variations, the property computation model determines the one or more properties of the input molecule further based on the multimodal token of each amino acid residue in the amino acid residue sequence of the input molecule.
[0043] In some variations, the tokenization model is trained to generate the multimodal representation of the input molecule.
[0044] In some variations, the tokenization model is trained to reconstruct one or more masked amino acid residue identities and / or masked backbone dihedral angles in a plurality of training samples.
[0045] In some variations, the plurality of training samples include at least one training sample comprising a protein molecule in which an identity or a backbone dihedral angle of some of a plurality of amino acid residues in the protein molecule are masked.
[0046] In some variations, the plurality of training samples include at least one training sample comprising a protein molecule in which a backbone dihedral angle of some of a plurality of amino acid residues in the protein molecule are masked with an exception of an identity of any of the plurality of amino acid residues.
[0047] In some variations, the plurality of training samples include at least one training sample comprising a protein molecule in which an identity of a plurality of amino acid residues in the protein molecule are masked with an exception of a backbone dihedral angle of any of the plurality of amino acid residues.
[0048] In some variations, the plurality of training samples include at least one training sample comprising a protein molecule in which a backbone dihedral angle of a plurality of amino acid residues in the protein molecule are masked with an exception of an identity of any of the plurality of amino acid residues.11131251356v l
[0049] In some variations, the tokenization model is trained on a contrastive objective to generate similar multimodal representations for a positive pair of sample molecules having similar three-dimensional structures and dissimilar multimodal representations for a negative pair of sample molecules having dissimilar three-dimensional structures.
[0050] In some variations, the training the tokenization model includes identifying a pair of sample molecules as the positive pair based on at least on (i) the pair of sample molecules containing a same quantity of amino acid residues, and (ii) a dihedral angle distance between a three-dimensional structure of each sample molecule satisfying one or more thresholds.
[0051] In some variations, the training the tokenization model includes identifying the pair of sample molecules as the negative pair based on at least on at least one of (i) the pair of sample molecules containing different quantities of amino acid residues, and (ii) the dihedral angle distance between the three-dimensional structure of each sample molecule satisfying one or more different thresholds.
[0052] In some variations, the training the tokenization model includes identifying the pair of sample molecules as neither the positive pair nor the negative pair based at least on the dihedral angle distance between the three-dimensional structure of each sample molecule failing to satisfy the one or more thresholds or the one or more different thresholds, and excluding the pair of sample molecules from the training the tokenization model.
[0053] In some variations, the tokenization model is trained to determine a codebook including a plurality of discrete classification (CLS) tokens.
[0054] In some variations, the input molecule comprises an antigen receptor, a variable domain, or a complementarity determining region (CDR).12131251356v l
[0055] In some variations, the input molecule includes a plurality of complementarity determining regions (CDRs) of an antibody. The multimodal representation of the input molecule is generated to include, for each complementarity determining region (CDR) of the plurality of plurality of complementarity determining regions (CDRs), a corresponding classification (CLS) token encoding a three-dimensional structure of the complementarity determining region (CDR).
[0056] In some variations, the tokenization model includes one or more transformer blocks.
[0057] Implementations of the current subject matter can include, but are not limited to, methods consistent with the descriptions provided herein as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations implementing one or more of the described features. Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a non-transitory computer-readable or machine-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and / or commands or other instructions or the like via one or more connections, including, for example, to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.13131251356v l
[0058] The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. While certain features of the currently disclosed subject matter are described for illustrative purposes in relation to the design of protein molecules, including antigen receptors such as antibodies and T-cell receptors (TCR), it should be readily understood that such features are not intended to be limiting. The claims that follow this disclosure are intended to define the scope of the protected subject matter.DESCRIPTION OF DRAWINGS
[0059] The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,
[0060] FIG. 1A depicts a system diagram illustrating an example of a molecule design system, in accordance with some example embodiments;
[0061] FIG. 1B depicts a system diagram illustrating an example of a molecule design system, in accordance with some example embodiments;
[0062] FIG. 2A depicts a flowchart illustrating an example of a process for canonical conformation guided molecule generation, in accordance with some example embodiments;
[0063] FIG. 2B depicts a flowchart illustrating an example of a process for training a canonical conformation computation model, in accordance with some example embodiments;14131251356v l
[0064] FIG. 3A depicts a flowchart illustrating an example of a process for token guided molecule generation, in accordance with some example embodiments;
[0065] FIG. 3B depicts a flowchart illustrating an example of a process for training a tokenization model, in accordance with some example embodiments;
[0066] FIG. 4 depicts a schematic diagram illustrating an example of the planes present in the amino acid residues of a portion of a protein molecule, in accordance with some example embodiments;
[0067] FIG. 5A depicts a schematic diagram illustrating an example of a process for sequence-structure coherence based generation of protein sequences, in accordance with some example embodiments; and
[0068] FIG. 5B depicts a schematic diagram illustrating an example of a molecule design system performing canonical conformation signature guided protein sequence generation, in accordance with some example embodiments;
[0069] FIG. 6A depicts a schematic diagram illustrating an example of a tokenization model generating a tokenized representation of an input molecule, in accordance with some example embodiments;
[0070] FIG. 6B depicts a schematic diagram illustrating an example of a process for training a tokenization model on a contrastive learning objective, in accordance with some example embodiments;
[0071] FIG. 6C depicts a schematic diagram illustrating an example of a process for training a tokenization model on masked dihedral angle and masked amino acid residue identity objectives, in accordance with some example embodiments;15131251356v l
[0072] FIG. 6D depicts a schematic diagram illustrating an example of a process for training a tokenization model on codebook learning, in accordance with some example embodiments;
[0073] FIG. 7 depicts the canonical cluster landscapes of SabDab and ABB2 folded pOAS structures, in accordance with some example embodiments;
[0074] FIG. 8 depicts a comparison of structural assignment performance using canonical conformation signatures and dihedral-angle based clustering of ESMFold, and ABB2 folded structures, in accordance with some example embodiments;
[0075] FIG. 9 depicts the results of canonical conformation signature guided novel protein sequence generation, in accordance with some example embodiments;
[0076] FIG. 10 depicts native binder generation and scoring using canonical conformation signatures, in accordance with some example embodiments;
[0077] FIG. 11 depicts the distribution of different complementarity determining region (CDR) loop structures across classification (CLS) token codebooks, in accordance with some example embodiments;
[0078] FIG. 12 depicts the results from controllable sampling of antibody loops using multimodal representations, in accordance with some example embodiments; and
[0079] FIG. 13 depicts a block diagram illustrating an example of a computing system, in accordance with some example embodiments.
[0080] When practical, similar reference numbers denote similar structures, features, or elements.DETAILED DESCRIPTION16131251356v l
[0081] A molecule may be designed to exhibit multiple properties of interest including, in the case of therapeutics, drug-like properties such as binding affinity, specificity, biological activity, developability, and / or the like. Lead optimization is one variation of molecular design in which a lead molecule (or another select molecule) is modified to enhance desirable properties while minimizing undesirable ones. For example, in some cases, the lead molecule may be an antibody identified through an animal immunization campaign as having binding affinity towards a target molecule, such as a viral antigen, a tumor antigen, and / or the like. While binding affinity is an example of a desirable property, the lead molecule may also exhibit one or more undesirable properties, such as insufficient human-ness, poor expression, immunogenicity, in vivo instability, and / or the like. As is, the lead molecule is unlikely to be a viable protein therapeutic and is therefore unsuitable for further drug development efforts. Instead, the lead molecule may undergo lead optimization, which in this case may include modifying the underlying sequence of amino acid residues, for example, by changing the identity of one or more constituent amino acid residues, such that the resulting molecules exhibit better properties than the lead molecule.
[0082] Modifications that improve the lead molecule, particularly those that enhance desirable properties without introducing liabilities, may be especially difficult to identify at least because of the vast combinatorial space of possibilities. For example, even for point mutations affecting a single amino acid residue, a lead molecule with an N quantity of amino acid residues can spawn 20N possible modifications. The number of possible modifications grows exponentially as the modifications in the lead molecule increases in scale, for example, from a single point mutation to mutations affecting multiple amino acid residues. Due to limitations in time and laboratory resources, conventional lead optimization protocols relying on wet lab analysis (e g., in vitro measurements, in vivo characterization, and / or the like) to examine the effects of17131251356v leach modification cannot adequately explore the possible modifications that can be made to a lead molecule.
[0083] While conventional, wet lab based approaches to molecular design are too resource intensive to support an expansive exploration of different molecule designs, more recent efforts to leverage artificial intelligence and machine learning are challenged by the scarcity of labeled (or annotated) data. In fact, the procurement of data arising from the development of protein based therapeutics is also cost prohibitive due to the same limitations in time and laboratory resources. As such, computation models trained to minimize the need for wet lab analysis, such as those capable of predicting the properties of protein molecules, are often confined to low data regimes in which the quantity of training data is limited and the distribution of the training data fails to capture the full range of possible inputs that the computation models are likely to encounter in deployment. Computation models trained in low data regimes tend to overfit to the training data and perform poorly as a consequence. For example, if a training dataset contains too few protein molecules with known values for a particular property, a computation model trained on that training dataset may be unable to accurately predict the value of that property for protein molecules outside of distribution of the training dataset. Because a poor performing computation model cannot serve as an adequate in silico surrogate for wet lab analysis of modified lead molecules during lead optimization, the same bottlenecks still persist in the drug discovery process.
[0084] Various embodiments of the present disclosure overcome the limitations of existing molecular design protocols, including computational methodologies, by incorporating structural representations that leverage the commonalities present in the conformations of protein molecules. For example, in some cases, a computation model may be trained to generate the structural representation of an input molecule exhibiting one or more properties of interest, such18131251356v las binding affinity, specificity, biological activity, developability, and / or the like. In some cases, the generating of novel protein sequences may be guided by the structural representation of the input molecule such that the resulting output molecules exhibit sufficient structural similarities to the input molecule and are therefore more likely to exhibit the same properties of interest as the input molecule. Furthermore, in some cases, one or more properties of the novel protein sequences may be determined based at least on the corresponding structural representations. In some cases, the structural representation of the input molecule may capture the three-dimensional structure of the input molecule based on a limited quantity of structural classes (or categories). For instance, in some cases, the structural representation of the input molecule may include one or more classification tokens encoding the three-dimensional structure of the input molecule based on a limited quantity of structural classes (or categories). Alternatively, in some cases, the structural representation of the input molecule may include a canonical confirmation signature specifying the three-dimensional structure of the input molecule based on a limited quantity of canonical conformations.
[0085] In some example embodiments, the generation of novel protein sequences may be guided by the canonical conformation signature of the input molecule generated, for example, by a canonical conformation computation model. In some cases, the canonical conformation signature of the input molecule may specify the three-dimensional structure of at least a portion of the input molecule based on a limited quantity of possible canonical conformations. For example, in the context of antibodies, five of the six complementarity determining regions (CDRs) in the variable region (Fv) of an antibody are able to assume a limited number of canonical conformations. Accordingly, as used herein, the term “canonical conformation” refers to one of a limited number of possible conformations (or three-dimensional19131251356v lstructures) of a class of protein molecules or a portion of that class of protein molecules. In some cases, the canonical conformation of a protein molecule or a portion of the protein molecule may be defined by the quantity of constituent amino acid residues and the identities of the amino acid residues present at certain key sites. For instance, the canonical conformation of a complementarity determining region (CDR) loop of an antibody may be defined by the length of the loop (e.g., the number of amino acid residues forming the loop) as well as the identities of the amino acid residues occupying one or more key sites. In some case, the canonical conformation computation model may be trained to determine the canonical conformation signature of the input protein molecule to include, for each possible canonical conformation class, a probability of the input protein molecule having the corresponding canonical conformation.
[0086] In some cases, the input protein molecule may be an antigen receptor (e.g., an antibody, a T-cell receptor (TCR), and / or the like) or a portion of the antigen receptor (e.g., complementarity determining region (CDR) loop and / or the like). Accordingly, in some cases, the canonical conformation computation model may ingest, as input, the sequence of amino acid residues forming a complementarity determining region (CDR) of an antigen receptor, such as an antibody, a T-cell receptor (TCR), and / or the like. Moreover, the canonical conformation computation model may determine, based at least on the sequence of amino acid residues, the probability of the complementarity determining region (CDR) of the input protein molecule assuming the canonical conformation of each canonical conformation class within a set of possible canonical conformation classes. In some cases, the canonical conformation computation model may output, as the canonical conformation signature of the input protein molecule, a probability distribution across the set of possible canonical conformation classes. As described in more details below, the set of possible canonical conformation classes may be determined based on the dihedral20131251356v lrepresentations of the conformations (or three-dimensional structures) of known protein molecules or a portion of known protein molecules, such as the complementarity determining region (CDR) loops of known antigen receptors (e.g., antibodies, T-cell receptors (TCRs), and / or the like).
[0087] In some example embodiments, the set of possible canonical conformation classes associated with the input protein molecule may be determined by at least clustering, into one or more clusters, the dihedral representations of known protein molecules or portions of known molecules. In some cases, each cluster of the one or more clusters may correspond to a canonical conformation class. In some cases, the dihedral representation of a protein molecule or a portion of the protein molecule may include the dihedral angle between each pair of amino acid residues present in the protein molecule or the portion of the protein molecule. In some cases, the dihedral angle between a pair of amino acid residues is the dihedral angle between two planes (e.g., peptide or amide plane), each of which being formed by the carbonyl carbon (Ccr) atom, oxygen (O) atom, amide nitrogen (N) atom, and amide hydrogen (H) in the corresponding residue. These four atoms lie in a same plane due to the partial double bond character of the peptide bond between the carbonyl carbon (C) atom and amide nitrogen (N) atom. In some cases, the dihedral representation of the protein molecule or the portion of the protein molecule may include the dihedral angle between the planes of pairs of amino acid residues that are adjacent in the sequence of amino acid residues as well as non-adjacent in the sequence of amino acid residues. In other words, in some cases, the dihedral representation of the protein molecule or the portion of the protein molecule is configured to capture the relative orientations of every pair of constituent amino acid residues, regardless of their ordering in the sequence of amino acid residues or the distance to one another in Cartesian space.21131251356v l
[0088] In some example embodiments, the input protein molecule may be a seed protein molecule such that one or more output protein molecules may be generated based at least on the canonical conformation signature of the input protein molecule. For example, in some cases, a molecule design computation model may be applied to generate an output protein molecule by at least modifying the sequence of the input protein molecule. In some cases, the input protein sequence may be modified by changing the identity of one or more amino acid residues including, in instances where the input protein sequence is rendered in a fixed-length representation, by replacing an amino acid residue with a gap character in the sequence to delete the amino acid residue or replacing a gap character in the sequence with an amino acid residue to insert the amino acid residue. In some cases, the generating of the output protein molecule may be guided by the canonical conformation signature of the input protein molecule and the canonical conformation signature of the output protein molecule in order to preserve the canonical conformation of the input protein molecule. For instance, if the input protein molecule exhibits certain desirable properties, such as binding affinity towards a target molecule (e.g., a viral antigen, a tumor antigen, and / or the like), generating the output protein molecule to exhibit the same (or similar) canonical conformation may increase the likelihood of the output protein molecule having a different underlying sequence of amino acid residues but the same desirable characteristics as the input protein molecule. Accordingly, in some cases, the output protein molecule may be generated to reduce (or minimize) a difference between the canonical conformation signature of the output protein molecule and that of the input protein molecule. As described in more details below, the difference between the canonical conformation signature of the output protein molecule and that of the input protein molecule may correspond to a difference between the respective probability distributions across the set of possible canonical conformations.22131251356v l
[0089] In some example embodiments, the generation of novel protein sequences may be guided by one or more classification tokens categorizing, into a limited quantity of structural classes (or categories), the three-dimensional structure of an input molecule exhibiting one or more properties of interest (e.g., binding affinity, specificity, biological activity, developability, and / or the like). For example, in some cases, the input molecule may include a sequence of amino acid residues. In some cases, a tokenization model may be trained to generate a multimodal representation of at least a portion of the input molecule such as, for example, one or more complementarity determining region (CDRs) in instances where the input molecule is an antibody. In some cases, the multimodal representation of the input molecule may include, for each constituent amino acid residue in at least the portion of the input molecule, a multimodal token encoding both the identity and structure (e.g., one or more backbone dihedral (or torsion) angles such as phi (< >), psi (i >), and omega (m)) of the amino acid residue. In some cases, the tokenization model may be trained to generate the multimodal representation of the input molecule to include at least one classification (CLS) token encoding the overall three-dimensional structure of at least the portion of the input molecule. For instance, in some cases, the tokenization model may generate a classification (CLS) token for one or more different portions of the input molecule, such as a classification (CLS) token for each complementarity determining region (CDR) in instances where the input molecule is an antibody. In some cases, the classification (CLS) token may assign, based at least on the identity and the constituent backbone dihedral (or torsion) angles of each amino acid residue in at least the portion of the input molecule, the three-dimensional structure of the input molecule to one or more structural classes (or categories).
[0090] In some example embodiments, the generation of novel protein sequences, for example, by a molecule design computation model, may be guided by the classification (CLS)23131251356v ltoken associated with the input molecule. For example, in some cases, one or more output protein sequences may be generated based on the respective classification (CLS) tokens of the input molecule and the output molecule. In some cases, the output molecule may be generated to be assigned the same classification token as the input molecule such that the three-dimensional structure of the output molecule exhibits sufficient similarities to the three-dimensional structure of the input molecule. In some cases, in addition to the classification (CLS) tokens of the input molecule and the output molecule, the generation of the output molecule may further incorporate the sequence of multimodal tokens encoding the identity and structure (e.g., one or more backbone dihedral (or torsion) angles such as phi (< >), psi (1 / 1) and omega (a>)) of the individual amino acid residues forming the input molecule and the output molecule. In some cases, one or more properties of the output molecule may be determined based on the structural representation of the output molecule including, for example, the classification (CLS) tokens, the multimodal tokens, and / or the like.
[0091] FIGS. 1A-B depict system diagrams illustrating examples of a molecule design system 100, in accordance with some example embodiments. Referring to FIGS. 1A-B, the molecule design system 100 may include a structural representation computation engine 110, a molecule design engine 120, a property analysis engine 130, and a client device 140 including a user interface 145. As shown in FIGS. 1A-B, the structural representation computation engine 110, the molecule design engine 120, the property analysis engine 130, and the client device 140 may be communicatively coupled via a network 170. The client device 140 may be a processorbased device including, for example, a workstation, a desktop computer, a laptop computer, a smartphone, a tablet computer, a wearable apparatus, and / or the like. The network 170 may be a wired network and / or a wireless network including, for example, a local area network (LAN), a24131251356v lvirtual local area network (VLAN), a wide area network (WAN), a public land mobile network (PLMN), the Internet, and / or the like. As described in more details below, in some cases, the structural representation computation engine 110 may determine, for an input molecule 150, a structural representation of the input molecule 150. In some cases, the molecule design engine 120 may generate an output molecule 165, for example, by modifying a seed molecule 160 while being guided by the structural representation of the input molecule 150. Furthermore, in some cases, the property analysis engine 130 may determine, based at least on the canonical conformation signature 155 of the input molecule 150, a property prediction 133 for the input molecule 150.
[0092] Referring first to FIG. 1A, in the example of the molecule design system 100 shown therein, the structural representation computation model 110 may may include a canonical conformation computation model 115. In some cases, the canonical conformation computation model 115 may include one or more machine learning models that have been trained to generate, for an input molecule 150, a canonical conformation signature 155 of the input molecule 150. In some cases, the input molecule 150 may be a protein molecule or a portion of the protein molecule. For example, in some cases, the input molecule 150 may be an antigen receptor, such as an antibody or a T-cell receptor (TCR), or a portion of the antigen receptor, such as a complementarity determining region (CDR) loop. In some cases, the canonical conformation signature 155 of the input molecule 150 may include a probability distribution indicating, for each canonical conformation class in a set of possible canonical conformation classes, a probability of the input molecule 150 having the corresponding canonical conformation. As described in more details below, the set of possible canonical conformation classes may be determined by at least clustering, into one or more clusters, known protein molecules or a portion of known protein25131251356v lmolecules (e.g., complementarity determining region (CDR) loops) based on the dihedral representation of each known protein molecule or the portion of the known protein molecules. Each resulting cluster may correspond to a canonical conformation class. Moreover, each canonical conformation class may therefore contain two or more known protein molecules or portions of known protein molecules exhibiting sufficiently similar conformations, as represented by the corresponding dihedral representations.
[0093] In some example embodiments, the molecule design engine 120 may include a molecule design computation model 125. In some cases, the molecule design computation model 125 may include one or more machine learning models that have been trained to generate, based at least on an input molecule such as the seed molecule 160 shown in FIG. 1, an output molecule 165. In instances where the seed molecule 160 is a protein molecule (e.g., an antigen receptor such as an antibody or a T-cell receptor (TCR)) or a portion of the protein molecule (e.g., a complementarity determining region (CDR) loop), the molecule design computation model 125 may modify the sequence of amino acid residues forming the seed molecule 160 (e.g., the primary structure of the seed molecule 160) to generate the output molecule 165. For example, in some cases, the seed molecule 160 may be an antibody identified through an animal immunization as having binding affinity towards a target molecule (e.g., a viral antigen, a tumor antigen, and / or the like) but insufficient human-ness or significant liabilities, such as poor expression, immunogenicity, in vivo instability, and / or the like. In some cases, the molecule design computation model 125 may be a diffusion model that modifies (or denoises) the seed molecule 160 incrementally over multiple iterations to generate the output molecule 165. For instance, the molecule design computation model 125 may modify (or denoise) the seed molecule 160 to26131251356v lgenerate one or more intermediate molecules before further modifying (or denoising) each intermediate molecule to generate the output molecule 165.
[0094] In some cases, the modifying (or denoising) may be guided by the canonical conformation signatures of the seed molecule 160 in order to preserve the canonical conformation of the seed molecule. For example, in some cases, the canonical conformation signature of the seed molecule 160 and that of an intermediate molecule may be determined by applying the canonical conformation computation model 115. In some cases, the molecule design computation model 125 may modify the seed molecule 160 such that the canonical conformation signature of the intermediate molecule exhibits a threshold similarity to the canonical conformation signature of the seed molecule 160. Furthermore, in some cases, the molecule design computation model 150 may further modify the intermediate molecule such that the canonical conformation signature of the output molecule 165 resulting therefrom exhibits the threshold similarity to the canonical conformation signature of the seed molecule 160 or is more similar to the canonical conformation signature of the seed molecule 160 than the intermediate molecule. Doing so may increase the likelihood of the output molecule 165 exhibiting one or more of the same desirable properties present in the seed molecule 160, such as binding affinity towards a target molecule (e.g., a viral antigen, a tumor antigen, and / or the like), despite having a different underlying sequence of amino acid residues.
[0095] In some cases, the molecule design computation model 125 may be applied to generate, based at least on the input molecule 150, the output molecule 160. For example, in some cases, the molecule design computation model 125 may generate the output molecule 160 by making incremental modifications to the amino acid residue sequence of the input molecule 150. In some cases, the molecule design computation model 125 may generate one or more27131251356v lintermediate molecules, each of which including an incremental modification relative to the input molecule 150, before generating the output molecule 160. In some cases, the generating of the output molecule 160 may be guided by the canonical conformation signature of the input molecule 150 in order to preserve the conformation of the input molecule 150, as specified by the canonical conformation signature. For instance, in some cases, the canonical conformation computation model 115 may be applied to determine the canonical conformation signature of the output molecule 160 as well as any intermediate molecules. In some cases, an intermediate molecule may be identified as the output molecule 160 or progress to a next iteration of modifications if the canonical conformation signature of the intermediate molecule exhibits a threshold similarity to the canonical conformation signature of the input molecule 150. Alternatively and / or additionally, the intermediate molecule may be identified as the output molecule 160 or progress to the next iteration of modifications if the canonical conformation of the intermediate molecule is more similar to the canonical conformation signature of the input molecule 150 than the canonical conformation signature of an intermediate molecule from a previous iteration.
[0096] In some example embodiments, the property analysis engine 130 may include a property computation model 132. In some cases, the property computation model 132 may include one or more machine learning models that have been trained to determine, based at least on the canonical conformation signature 131 of the output molecule 165, a property prediction 133. In some cases, the property prediction 133 may include one or more properties of interest, such as binding affinity, specificity, biological activity, developability, and / or the like. For example, in some cases, the binding affinity (or another property of interest) of the output molecule 165 may be scored based at least on the difference (e.g., quantified by Kullback-Leibler (KL) divergence, Wasserstein metric, and / or the like) between the canonical conformation signature 13128131251356v lof the output molecule 165 and that of one or more known binders of a target antigen, such as the canonical conformation signature 155 of the input molecule 150. In some cases, a smaller difference between the canonical conformation signature 131 of the output molecule 165 and, for example, the canonical conformation signature 155 of the input molecule 150 may indicate that the three-dimensional structure of the output molecule 165 is sufficiently similar to the canonical conformation of the input molecule 150. In instances where the canonical conformation signature 155 of the input molecule 150 specifies the canonical conformation of one or more complementarity determining regions (CDRs) of the input molecule 150, that the canonical conformation signature 131 of the output molecule 165 is sufficiently similar to the canonical conformation signature 155 of the input molecule 150 may indicate that the output molecule 165 is also a likely binder of the same target antigen as the input molecule 150.
[0097] Referring now to FIG. IB and the variation of the molecule design system 100 depicted therein. In some example embodiments, the structural representation computation model 110 may include a tokenization model 116. In some cases, the tokenization model 116 may include one or more machine learning models that have been trained to generate a multimodal representation 156 of the input molecule 150. In some cases, the multimodal representation 156 of the input molecule 150 may include a sequence of tokens encoding the sequence and structure of the input molecule 150. For example, in some cases, the multimodal representation 156 of the input molecule 150 may include, for each of an n quantity of amino acid residues in the input molecule 150, a corresponding multimodal token (e.g., MM1,MM2, •••, MM„ in FIG. IB). In some cases, the multimodal token of an amino acid residue may encode the identity of the amino acid residue and one or more backbone dihedral (or torsion) angles present in the amino acid residue (e g., phi (< >), psi (i ) and omega (&>)). In some cases, the one or more backbone dihedral (or29131251356v ltorsion) angles may undergo preprocessing, including geometric projection onto a unit circle, prior to being encoded.
[0098] In some cases, the multimodal representation 156 of the input molecule 150 may include at least one classification (CLS) token. For instance, in some cases, the tokenization model 116 may generate a classification (CLS) token for each complementarity determining region (CDR) of the input molecule 150. In some cases, each classification (CLS) token may encode the identity and three-dimensional structure of the individual amino acid residues associated with the classification (CLS) token. The classification (CLS) token of a complementarity determining region (CDR) of the input molecule 150, for example, may encode the identity and structure of every amino acid residue in that complementarity determining region (CDR).
[0099] In some example embodiments, the generation of novel protein sequences may be guided by the multimodal representation 156 of the input molecule 150. For example, in some cases, the molecule design computation model 125 may be applied to generate the output molecule 165 by at least modifying the seed molecule 160. In some cases, the seed molecule 160 may be modified such that the output molecule 165 resulting therefrom is assigned the same classification (CLS) token as the input molecule 150, thereby preserving the three-dimensional structure of at least a portion of the input molecule 150 (e.g., one or more complementarity determining regions (CDRs)) and the properties of interest associated with the three-dimensional structure of at least the potion of the input molecule 150. In some cases, the generation of the output molecule 165 may further incorporate the multimodal tokens (e g., MM1,MM2, •••, MMnin FIG. IB) of the amino acid residues forming at least the portion of the input molecule 150 (e.g., the one or more complementarity determining regions (CDRs)). In some cases, guiding the generation of the output molecule 165 based on the multimodal representation 156 of the input30131251356v lmolecule 150 may enable the output molecule 165 to be generated with different levels of sequence diversity while maintaining a sufficiently similar structure to the input molecule 150, thereby increasing the likelihood of the output molecule 165 being novel but exhibiting the same properties of interest as the input molecule 150.[000100] In some example embodiments, the property computation model 132 may be trained to determine, based at least on the multimodal representation 131 of the output molecule 165, the property prediction 133. As noted, in some cases, the property prediction 133 may include one or more properties of interest, such as binding affinity, specificity, biological activity, developability, and / or the like. For example, in some cases, the binding affinity (or another property of interest) of the output molecule 165 may be determined based at least on the classification (CLS) token associated with one or more portions of the output molecule 165, such as one or more complementarity determining regions (CDRs) of the output molecule 165. Furthermore, in some cases, the binding affinity (or another property of interest) of the output molecule 165 may be determined based at least on the multimodal tokens associated with the amino acid residues forming the output molecule 165.[000101] FIG. 2A depicts a flowchart illustrating an example of a process 200 for canonical conformation guided molecule generation, in accordance with some example embodiments. Referring to FIGS. 1A and 2A, in some example embodiments, the process 200 may be performed to generate, for example, the output molecule 165 by at least applying the molecule design computation model 125 to modify the seed molecule 160. As described in more details below, the canonical conformation computation model 115 may be applied to generate a canonical conformation signature for the seed molecule 160, the output molecule 165, and, in instances where the output molecule 165 is generated incrementally over multiple iterations, one31131251356v lor more intermediate molecules. Moreover, the generating of the output molecule 165 may be guided by the canonical conformation signatures generated by the canonical conformation computation model 115 to ensure that the canonical conformation of the output molecule 165 is sufficiently similar to that of the seed molecule 160.[000102] At 202, an input molecule is identified. In some example embodiments, the input molecule may be a protein molecule having a sequence of amino acid residues. In some cases, the input molecule may be s subsequence of a protein molecule, such as one or more complementarity determining regions (CDRs) of an antibody. In some cases, the sequence of amino acid residue forming the input molecule may be identified or, alternatively, received. In some cases, the input molecule may be an antigen receptor, such as an antibody or a T-cell receptor (TCR). In some cases, the input molecule may include one or more portions of the protein molecule, such as a complementarity determining region (CDR) loop of an antigen receptor. In some cases, the input molecule may include the variable domain of the heavy chain and / or the variable domain of the light chain of an antibody. In some cases, while the input molecule includes a single sequence of amino acid residues forming the protein molecule, one or more tokens may be inserted to separate different portions of the protein molecule. For instance, in some cases, one or more tokens may separate the heavy chain of the protein molecule from the light chain of the protein molecule by at least indicating a start (or end) of the sequence of amino acid residues forming the heavy chain and / or a start (or end) of the sequence of amino acid residues forming the light chain.[000103] In some example embodiments, the input molecule may exhibit one or more desirable properties. For example, in some cases, the input molecule may be a seed molecule, such as an antibody or a complementarity determining region (CDR) loop of the antibody identified32131251356v lthrough an animal immunization campaign as having binding affinity towards a target molecule (e.g., a viral antigen, a tumor antigen, and / or the like). In some cases, the input molecule may undergo modification to further improve the one or more desirable properties or, in some cases, reduce (or eliminate) one or more undesirable properties. As described in more details below, the modifying of the input molecule may be guided by the canonical conformation of the input molecule to ensure that the modifications to the sequence of amino acid residues forming the input molecule preserves the canonical conformation of the input molecule.[000104] At 204, a canonical conformation computation model is used to generate a canonical conformation signature indicative of a conformation of the input molecule. In some example embodiments, the canonical conformation computation model may be trained to determine, based at least on the sequence of amino acid residues forming the input molecule, the canonical conformation signature of the input molecule. In some cases, the canonical conformation computation model may be trained to determine, for each canonical conformation class in a set of possible canonical conformation classes, a probability of the input molecule having the corresponding canonical conformation. Accordingly, in some cases, the canonical conformation signature of the input molecule may include a probability distribution across the set of possible canonical conformation classes. For example, the canonical conformation signature of the input molecule may include a first probability of the input molecule having a first canonical conformation, a second probability of the input molecule having a second canonical conformation, and / or the like. In instances where the input molecule includes the variable domain of the heavy chain as well the variable domain of the light chain of an antibody, the first probability may be that of the input molecule exhibiting one of the possible heavy chain canonical conformations while the second probability may be that of the input molecule exhibiting one of the possible light33131251356v lchain canonical conformations. As described in more details below, the set of possible canonical conformation classes may be determined based on the conformations of known protein molecules (e.g., antigen receptors such as antibodies or T-cell receptors (TCRs)) or portions of known protein molecules (e.g., complementarity determining region (CDR) loops). For instance, in some cases, the set of possible canonical conformation classes may be determined by at least clustering the dihedral representations of the conformations of known protein molecules or portions of known protein molecules.[000105] At 206, an output molecule is generated by at least modifying the amino acid residue sequence of the input molecule while guided by the canonical conformation signature of the input molecule. In some example embodiments, the molecule design computation model may generate the output molecule by at least modifying the sequence of amino acid residues forming the input molecule. For example in some cases, the molecule design computation model modify the amino acid residue sequence of the input molecule by inserting, deleting, or changing the identity of one or more amino acid residues. In instances where the amino acid residue sequence of the input molecule is rendered in a fixed-length representation, the insertion and deletion of an amino acid residue may be accomplished by the amino acid residue replacing or being replaced by a gap character (to indicate the absence of an amino acid residue at a corresponding position in the amino acid residue sequence). In some cases, the molecule design computation model may generate the output molecule over multiple successive iterations. For instance, in some cases, the molecule design computation model may be a diffusion model trained to modify (or denoise) the input molecule over multiple successive iterations.[000106] In some example embodiments, the generating of the output molecule may be guided by the canonical conformation signature of the input molecule in order to preserve the34131251356v lcanonical conformation of the input molecule. For example, in some cases, the canonical conformation of the input molecule may be associated with one or more desirable properties, such as binding affinity to a target molecule. Accordingly, in instances where the output molecule is generated through incremental modifications to the input molecule, the modification of the input molecule may progress in accordance with certain criteria with respect to the canonical conformation signatures of the resulting molecules. For instance, in some cases, with each iteration of modifications, the canonical conformation computation model may be applied to determine a canonical conformation signature for the resulting intermediate molecule. An intermediate molecule may be identified as an output molecule or progress to a next iteration of modifications if the canonical conformation signature of the intermediate molecule exhibits a threshold similarity to the canonical conformation signature of the input molecule. Alternatively and / or additionally, the intermediate molecule may be identified as the output molecule or progress to the next iteration of modifications if the canonical conformation of the intermediate molecule is more similar to the canonical conformation signature of the input molecule than the canonical conformation signature of an intermediate molecule from a previous iteration. In instances where the input molecule is modified to alter one or more other properties present in the input molecule, an intermediate molecule generated by modifying the input molecule may be identified as the output molecule or progress to the next iteration of modifications if the one or more other properties of the intermediate molecule are determined to satisfy one or more corresponding criteria.[000107] It should be appreciated that the similarity between two canonical conformation signatures may be quantified by the similarity between the two corresponding probability distributions. For example, in some cases, the similarity between the canonical35131251356v lconformation signature of an intermediate molecule and the canonical conformation signature of the input molecule may be quantified by a similarity metric measuring a difference between two probability distributions, such as Kullback-Leibler (KL) divergence, Wasserstein metric, and / or the like. Accordingly, an intermediate molecule exhibiting an incremental modification to the input molecule may be identified as the output molecule or progress to the next iteration of modifications if the similarity metric of the intermediate molecule satisfies one or more threshold values.[000108] In some example embodiments, the molecule design computation model may generate the output molecule by at least performing a discrete diffusion denoising task while guided by the canonical conformation signature of the input molecule. In some cases, this guided diffusion with explicit steering may be reparametrized as a search over continuous hidden states h that define categorical distributions over tokens when transformed through the final linear token classification head and softmax. To further illustrate, let H denote the random variable representing hidden state from a shared encoder g -~) that maps protein sequences to hidden states. As described in more detail below, the shared encoder g( ) may be applied to generate a representation of the input molecule for both the generative task as well as the determination of the corresponding canonical conformation signature.[000109] In some cases, the molecule design computation model may implement diffusion optimized sampling (NOS) when generating novel protein sequences. In some cases, diffusion optimized sampling (NOS) may be performed with the value function shown as Equation (3) below, which is adapted for preserving the canonical conformation of complementarity determining regions (CDRs) in antibodies. In some cases, guidance from this value function may ensure that the canonical conformation signature (or the canonical conformation class distribution)36131251356v lof the output molecules are sufficiently similar to those of the input molecule exhibiting one or more properties of interest.(3)wherein = Ji) and p \ Cij\H = g(x) } denote the canonical conformation signatures (orthe canonical conformation class distributions) of complementarity determining region j of chain i computed from the guided hidden states and seed hidden states, respectively.[000110] Table 1 below depicts pseudocode implementing an example of the algorithm for implementing generation with diffusion optimized sampling (NOS).[000111] Table 1Algorithm: Generation with Diffusion Optimized Sampling (NOS)Input: Seed sequence x, guidance regularization strength A, denoising steps T, guidance steps SOutput: Generated design xo'Predict seed CDR canonical classes: \p \ Cij\H = g(x)j [;Define infilling region 7 (e.g., selected CDR positions);Initialize design xo' = x;for denoising step t = T — 1 to 0 doSelect subset £ 7 of infilling region to mask;Apply masking: xt' <- (x0',7t);Initialize guided states: «- ht;Initialize updated proposal: xo'' <- XQ;for s = 1 to S doCompute guidance signal: v(ht') <- Eq. 3;Compute gradient: [(1 -AM / i[) + AKL(p(X| / i[)||p(X| / it))];Update activations: ht' «- ht' + JjV / J-];Resample masked tokens: xo'' ~ softmax(W h't[i] + b), Vj 6 Jt;Compute v(Xo) by re-encoding: v(xo'') «- v(g(x0'');x);if V(XQ) < v(xo'') thenAccept update: XQ <- XQreturn XQ37131251356v 1[000112] FIG. 2B depicts a flowchart illustrating an example of a process 250 for training a canonical conformation computation model, in accordance with some example embodiments. Referring to FIGS. 1A and 2A-B, in some example embodiments, the process 250 may be performed to train, for example, the canonical conformation computation model 115 to generate, for the input molecule 150, the output molecule 165 by at least applying the molecule design computation model 125 to modify the seed molecule 160. As described in more details below, the generating of the output molecule 165 may be guided by the canonical conformation signature of the output molecule 165 to ensure that the canonical conformation of the output molecule 165 is sufficiently similar to that of the seed molecule 160.[000113] At 252, a set of canonical conformation classes are determined. In some example embodiments, the set of canonical conformation classes may be determined based on the conformations of known protein molecules or portions of known protein molecules. For example, in some cases, the set of canonical conformation classes may be determined based on the conformations of known antigen receptors (e.g., antibodies, T-cell receptors (TCRs), and / or the like) or portions of known antigen receptors (e.g., complementarity determining region (CDR) loops). For antigen receptors, the set of canonical conformation classes may include canonical conformation classes that are specific to the different chains within the antigen receptor. For example, for antibodies, the set of canonical conformation classes may include some canonical conformation classes that are associated with the heavy chain and some that are associated with the light chain. For T-cell receptors (TCR), the set of canonical conformation classes may include some conformation classes that are associated with the alpha chain and some that are associated with the beta chain. In some cases, the set of canonical conformation classes may be determined based on a dihedral representation of the conformations of known protein molecules or portions of38131251356v lknown protein molecules. For instance, in some cases, the set of canonical conformation classes may be determined by clustering the dihedral representations of the known protein molecules or portions of known protein molecules. As described in more details below, the dihedral representation of a protein molecule or a portion of the protein molecule may include the dihedral angle between each pair of amino acid residues present in the protein molecule or the portion of the protein molecule.[000114] At 254, a canonical conformation computation model is trained to determine, based at least on a sequence of amino acid residues forming an input molecule, a canonical conformation signature. In some example embodiments, the canonical confirmation signature may include, for each canonical conformation class in the set of canonical conformation classes, a probability of the input molecule exhibiting a canonical conformation associated with the canonical conformation class. In some example embodiments, the canonical conformation computation model may include one or more machine learning models, such as artificial neural networks, that have been trained to classify the input molecule based on the amino acid residue sequence of the input molecule. In some cases, the canonical conformation computation model may be trained using a training dataset that includes the amino acid residue sequence of one or more known protein molecules (or portions of known protein molecules) and the corresponding conformations in order to learn the relationship between the amino acid residue sequence of a protein molecule and the conformation (or three-dimensional structure) of the protein molecule. For example, in some cases, the training of the canonical conformation computation model may include adjusting one or more parameters (e.g., weights, biases, and / or the like) of the canonical conformation computation model such that the canonical conformation computation model generates, for each known protein molecule or portion of known protein molecule in the training39131251356v ldataset, a canonical conformation signature that is consistent with the corresponding ground-truth canonical conformation.[000115] In some cases, the canonical conformation computation model may be trained to classify the input molecule by at least determining, for each canonical conformation class in the set of canonical conformation classes, a probability of the input molecule having the canonical conformation of the canonical conformation class. As noted, each canonical conformation class in the set of canonical conformation classes may be associated with one of a limited number of canonical conformation (or three-dimensional structures) associated with a certain class of protein molecules, each of which being defined by the quantity of amino acid residues and the identifies of the amino acid residues present at certain key sites. In other words, there are certain commonalities present in the conformation (or three-dimensional structure) of the a certain class of protein molecules, which can be leveraged for the design of new protein molecules from the same class of protein molecules. For example, in some cases, the canonical conformation computation model may be trained, to generate, for the input molecule, a canonical conformation signature that includes a first probability of the input molecule having a first canonical conformation and a second probability of the input molecule having a second canonical conformation. The resulting canonical conformation signature may describe the conformation (or three-dimensional structure) of the input molecule as a set of probabilities across the set of canonical conformations. For instance, in some cases, the canonical conformation signature of the input molecule may indicate that the input molecule is more likely to exhibit the first canonical conformation than the second canonical conformation. Accordingly, as described in more details below, when the input molecule undergoes modifications to generate one or more new protein molecules, the canonical conformation signature of the input molecule may indicate the extent to40131251356v lwhich these modifications alter the canonical conformation of the input molecule. Where the design objective includes applying modifications that still preserve the canonical conformation of the input molecule, the canonical conformation signature of the input molecule may be compared to that of a modified molecule to determine whether the modified molecule exhibits a similar (or same) canonical conformation as the input molecule.[000116] At 256, the trained canonical conformation computation model is applied to guide the generation of one or more protein molecules. In some example embodiments, the trained canonical conformation computation model may be applied to generate canonical conformation signatures to guide the generation of the one or more protein molecules. For example, in some cases, a molecule design computation model may be applied to generate an output molecule by modifying a seed molecule. In some cases, the molecule design computation model may generate the output molecule by making incremental modifications to the seed molecule over multiple successive iterations, thereby generating one or more intermediate molecules. In some cases, the trained canonical conformation computation model may be applied to generate a canonical conformation signature for each of the seed molecule, the output molecule, and the one or more intermediate molecules. In some cases, the generating of the output molecule may be guided by the canonical conformation of the seed molecule in order to preserve the canonical conformation of the seed molecule. For instance, in some cases, each successive intermediate molecules may be generated such that the corresponding canonical conformation signature exhibits a threshold similarity (e g., Kullback-Leibler (KL) divergence) to the canonical conformation signature of the seed molecule. Alternatively and / or additionally, each successive intermediate molecules may be generated such that the corresponding canonical conformation signature is more similar to the41131251356v lcanonical conformation signature of the seed molecule than the canonical conformation signature of an intermediate molecule from a previous iteration.[000117] FIG. 3A depicts a flowchart illustrating an example of a process 300 for token guided molecule generation, in accordance with some example embodiments. Referring to FIGS. IB and 3A, in some example embodiments, the process 300 may be performed to train and apply, for example, the tokenization model 116 to generate the multimodal representation 156 of the input molecule 150. In some cases, the molecule design computation model 125 may be applied to generate, for example, the output molecule 165 by at least modifying the seed molecule 160 while guided by the multimodal representation 156 of the input molecule 150. In some cases, the input molecule 150 may serve as a reference molecule and may be the same molecule as the seed molecule 165 or a different molecule altogether. In some cases, the input molecule 150 may exhibit one or more properties of interest, such as binding affinity, specificity, biological activity, developability, and / or the like. In some cases, the tokenization model 116 may be trained to generate the multimodal representation 156 of the input molecule 150 to encode the sequence and three-dimensional structure of the input molecule 150 or one or more specific portions (or regions) thereof. In some cases, guidance from the multimodal representation 156 of the input molecule 150 may ensure that the resulting output molecule 165 exhibits a similar three-dimensional structure as the input molecule 150, thereby increasing the likelihood of the output molecule 165 also exhibiting the same properties of interest as the input molecule 150.[000118] At 302, a tokenization model is trained to generate multimodal representations. In some example embodiments, the tokenization model may be trained to generate multimodal representations that encode the sequence and three-dimensional structure of protein molecules. In some cases, the tokenization model may be trained to generate, for an input42131251356v lmolecule, a multimodal representation that includes a multimodal token for each constituent amino acid residue in at least a portion (or region) of the input molecule. In some cases, the multimodal token of an amino acid residue may encode the identity of the amino acid residue (e.g., one of twenty canonical amino acid residues) and one or more backbone dihedral angles (e.g., phi (<p), psi (i / >), omega (a>), and / or the like). In some cases, the multimodal representation of the input molecule may further include a classification (CLS) token encoding the identifies and the three-dimensional structures of the amino acid residues forming at least the portion (or region) of the input molecule. In some cases, the tokenization model 116 may be trained on multiple objectives including, for example, masked reconstruction of dihedral angles, masked reconstruction of amino acid residue identifies, contrastive learning of protein backbones, codebook learning, and / or the like.[000119] At 304, the tokenization model is applied to generate a multimodal representation of an input molecule. In some example embodiments, the multimodal representation of the input molecule may include, for each amino acid residue forming at least a portion (or region) of the input molecule, a multimodal token encoding the identity of the amino acid residue and its backbone dihedral angles (e.g., phi (< >), psi ( / >), omega (a>), and / or the like). In some cases, the multimodal representation of the input molecule may further include at least one classification (CLS) token. For example, in some cases, the multimodal representation of the input molecule may include, for each complementarity determining region (CDR) of the input molecule, a classification (CLS) token encoding the identities and three-dimensional structures of the constituent amino acid residue.[000120] To further illustrate, in some cases, the input molecule may include n amino acid residues. In some cases, the input molecule may include a portion (or region) of a protein43131251356v lmolecule, such as the complementarity determining region (CDR) of an antibody. Tn some cases, the input molecule may be defined by its sequence of amino acid residues a = {alt•••, an} wherein Vi, ct-i E V = {Ala, Arg,..., Tyr, Val} are the canonical amino acid residues. In some cases, the input molecule may be further defined by the backbone dihedral angles (, i, m) G (— n, n}nof the constituent amino acid residues. In some cases, the tokenization model, denoted as (■), may be trained to generate multimodal representations such that (a, < >, i, to) = t, wherein t G HHdand d denotes the dimensions of the individual tokens forming the multimodal representation. In some cases, the token t may be used to identify two protein molecules (e.g., two complementarity determining regions (CDRs)) sharing a similar three-dimensional structure. In some cases, the input of the tokenization model (■) may include n amino acid residues of the input molecule, each of which having dihedral angles (< / >, i / i, m) G (— n, n}nx3and amino acid residue identities a.In some cases, the dihedral angles (, i, a>) may be converted into coordinates on the unit circle (cos (p, cos x, cos a), sin (p, sin i, sin co) G (— l,l]”x6before being projected with a linear layer D = (cos (p, cos i, cos co, sin (p, sin, sin co) Wdihedrai+ bdihedrai, wherein D G JRrexd, WdjhedraieUdx6, and bdihedraiG IRd. In some cases, the amino acid residue identities a may be encoded with learnable embeddings into A G JRnxd, for example, for each of the twenty canonical amino acid residue types. In some cases, the amino acid residue identity and dihedral angle embeddings may be summed to produce, for the input molecule, a multimodal embedding X = D 4- A G IRnX£i. In some cases, the classification (CLS) token t encoding the overall structure of the input molecule (or at least the portion or region thereof), for example, across the multimodal embeddings X = (x1(•••,xn) of the constituent amino acid residues may be determined using a transformer-based machine learning model. In some cases, the classification (CLS) token t may be added to one end of the sequence multimodal embeddings X, such as the start (or the end) of44131251356v lthe sequence, to provide a meaningful overall representation of the three-dimensional structure of the input molecule. In some cases, where the input molecule includes multiple complementarity determining regions (CDRs) of an antibody, the multimodal representation of the input molecule may include, for each complementarity determining region, a corresponding classification (CLS) token t.[000121] At 306, one or more molecule design tasks are performed based on the multimodal representation of the input molecule. In some example embodiments, the one or more molecule design tasks may include guided generation of novel protein sequences in which a molecule design computation model is applied to generate an output molecule while guided by the multimodal representation of an input molecule. For example, in some cases, the molecule design computation model may be applied to generate the output molecule by at least modifying a seed molecule. In some cases, the seed molecule may be the same molecule or a different molecule as the input molecule. In some cases, the modifying of the seed molecule may be guided by the multimodal representation of the input molecule in order to preserve, in the resulting output molecule, the three-dimensional structure of the input molecule. For instance, in some cases, the seed molecule may be modified such that the resulting output molecule is associated with the same (or sufficiently similar) multimodal representation as the input molecule. In some cases, the modifying of the seed molecule may be guided by the classification (CLS) token of the input molecule such that the resulting output molecule is associated with the same (or sufficiently similar) classification (CLS) token as the input molecule.[000122] In some cases, the one or more molecule design tasks may include identifying, based at least on the multimodal representation of the input molecule, one or more molecules having a similar three-dimensional structure (e.g., backbone structure). For example,45131251356v lin some cases, the one or more molecule design tasks may include paratope retrieval in which complementarity determining regions (CDRs) having a similar three-dimensional structure (e.g., backbone structure) as a reference molecule are identified based on the multimodal representation of the input molecule.[000123] In some cases, the one or more molecule design tasks may include determining, based at least on the multimodal representation of an input molecule, one or more properties of the input molecule. In some cases, the one or more properties may include binding affinity, specificity, biological activity, developability, and / or the like. For example, in some cases, a property computation model may be applied to determine, based at least on the classification (CLS) token included in the multimodal representation of the input molecule, one or more properties of the input molecule. Alternatively, in some cases, the property computation model may be applied to determine, based at least on the multimodal token of each amino acid residue in the input molecule, the one or more properties of the input molecule.[000124] In some cases, the one or more molecule design tasks may include identifying, based at least on the multimodal representation of an input molecule, a cluster (or group) of other molecule having a similar three-dimensional structure as the input molecule. For example, in some cases, the input molecule may be assigned to a cluster (or group) of similar complementarity determining regions (CDRs) based on the classification (CLS) token in its multimodal representation.[000125] FIG. 3B depicts a flowchart illustrating an example of a process 350 for training a tokenization model, in accordance with some example embodiments. Referring to FIGS. IB and 3A-B, in some cases, the process 350 may implement operation 302 of the process 300 shown in FIG. 3 A. In some example embodiments, the process 350 may be performed to train,46131251356v lfor example, the tokenization model 116 to generate the multimodal representation 156 of the input molecule 150. In some cases, the multimodal representation 156 of the input molecule 150 may be used to guide the generation of the output molecule 165, for example, by the molecule design computation model 125. As described in more detail below, in some cases, the tokenization model 116 may be trained on multiple objectives including, for example, masked reconstruction of dihedral angles, masked reconstruction of amino acid residue identifies, contrastive learning of protein backbones, codebook learning, and / or the like.[000126] At 352, a tokenization model is trained to reconstruct masked dihedral angles. In some example embodiments, the masking of dihedral angles may include one or more of (i) masking uniformly at random the dihedral angles and amino acid residue identities of a first portion (e.g., 20%) of training samples, (ii) masking uniformly at random the dihedral angles but not the amino acid residues of a second portion (e.g., 20%) of training samples, and (iii) completely masking the dihedral angles while preserving the amino acid residue identities of a third portion (e.g., 10%) of training samples. In some cases, for dihedral angle 0(- of the i-th amino acid residue in a training sample, the predicted unit circle coordinates ((,y() may be given by passing the hidden representation of the amino acid residue to a multilayer perceptron (MLP). In some cases, the reconstruction loss / ’dihedral reconj may be given by the mean squared error between the groundtruth dihedral angles (cos ditsin 0() of the training sample and the predicted dihedral angles (cos 0i,sin0t), wherein 6t = arctan In some cases, a penalty term ’dihedral reg. may beadded to regularize the tokenization model for the reconstruction of coordinates on the unit circle.[000127] At 354, the tokenization model is trained to reconstruct masked amino acid residue identities. In some example embodiments, the masking of amino acid residue identities may include one or more of (i) masking uniformly at random the dihedral angles and amino acid47131251356v lresidue identities of a first portion (e.g., 20%) of training samples, (ii) masking uniformly at random the amino acid residues but not the dihedral angles of a second portion (e.g., 20%) of training samples, and (iii) completely masking the amino acid residue identifies while preserving the dihedral angles of a third portion (e.g., 10%) of training samples. In some cases, the logits of a masked amino acid residue may be given by passing the hidden representation of the amino acid residue to a multilayer perceptron (MLP). In some cases, the amino acid residue masking lossAAjframino acid residue i in a training sample (e.g., a complementarity determining region (CDR)) may be given as the cross-entropy loss of the predicted amino acid residue identity and the ground-truth amino acid residue identity.[000128] At 356, the tokenization model is trained to generate similar tokens for similar three-dimensional structures. In some example embodiments, the tokenization model may be trained through contrastive learning to generate similar tokens for similar three-dimensional structures. In some cases, in order for the tokenization model to learn a token t such that similar tokens share similar three-dimensional structures (e.g., similar complementarity determining region (CDR) conformations), a contrastive loss function may be defined as the similarity between two sample molecules (e.g., complementarity determining region (CDR) loops) u and v of lengths nuand nv, respectively, with nu< nv. In some cases, the sample molecules may have backbone dihedral angles (< / >“, ipu, < DU) G (— TT, 7T]’1UX3and (< / >v, ipv, (JOV) G (— n, n]nvX3. In some cases, the similarity between the three-dimensional structures of the sample molecules may be quantified by the dihedral distance T> between the two structures, which is expressed as Equation (4) below. D= SXi 2(1 - cos(e!* - mi)) (4) wherein P aligns the amino acid residues of sample molecule v to the amino acid residues of sample molecule u. In instances where the two sample molecules have the same quantity of48131251356v lconstituent amino acid residues (or nu= nv), the alignment between the two sample molecules may be a one-to-one mapping between the amino acid residues in each sample molecule. In instances where the two sample molecules have mismatched quantities of constituent amino acid residues (or nu r), the two structures may be aligned using a dynamic time warping. It should be appreciated that due to the chirality of protein molecules in which the orientation of the backbone frame strongly influences atomic structure through sidechain positioning, the dihedral angle distance D may be used instead of root mean square distance (RMSD) to quantify the similarity between the three-dimensional structures of the two sample molecules at least because prominent examples exist where the backbone root mean square distance (RMSD) between two three-dimensional structures are low yet the dihedral angles could still be up to 180° apart with opposite-pointing sidechains.[000129] Table 2 below depicts pseudocode implementing an example of the algorithm for aligning and determining the dihedral distance T> between the two sample molecules with different quantities of amino acid residues, such as two complementarity determining regions (CDRs) of different lengths.[000130] Table 2Algorithm Dihedral angle distance between pair of sample molecules of different lengths Input: dihedral angles E (— n, 7i]n, 02,i^2, m2G (— it, n]m; structural CacoordinatesLxG HRnx3, L2G IRmx3;stem Cacoordinates S1(S26 ]Rwstemx3- tolerance k (maximum residue length difference)Output: dihedral angle distance T) G [0,4]1: if |n — m\ > k then2: return 4.03: end if4: «- mean(Sx); [i2«- mean(S2)5: Sx<- Sx— S2S2— ii26: (R, t, RMSDstem) <- Kabsch^Sq, S2)7: if RMSD > 1.0A then49131251356v l8: return 4.09: end if10: Li <- (Li - + fi2+ 1; k L211: J* <- DTW(LX, L2) > warping path P mapping residues of Lxto L212: ^(0i), J’(V'i), Oi)13: <- mean (2(1 — cos(^>1— 02))}; Zfy <- mean (2(1 — cos^i — i / ^))}; *“ mean(2(l — cosCoq — m2)))14: T> «- mean(Z)^,15: return T>[000131] In some cases, the pair of sample molecules u and v may be identified as a positive pair (Yuv= 1) if the two sample molecules are the same length and the dihedral angle distance T) between the corresponding three-dimensional structures satisfies one or more thresholds (e.g., D < 0.1). In some cases, the pair of sample molecules u and v may be a negative pair (Yuv= 0) if the sample molecules are different lengths or, where the two sample molecules have the same length, the dihedral angle distance T) therebetween satisfies one or more different thresholds (e.g., T> > 0.47). In some cases, where none of the foregoing criteria are met, it may be possible for the pair of sample molecules u and v to be neither a positive pair (Yuv= 1) nor a negative pair (Yuv= 0). In some cases, the dihedral loss ^contrastive, wv defined in Equation (5) may correspond to the mean binary cross-entropy over the pairs of sample molecules in the batch with both positive and negative labels.contrastive, = BCE (tf (^), fuv) (5)wherein hu= 77, tuis the classification (CLS) token embedding for the sample molecule u inthe batch, and T is the temperature controlling the randomness of the output of the tokenization model. In some cases, contrastive learning may be applied instead of predicting the dihedral distance D as a pretraining task for similar structures to be close in latent space (or embedding space). In some cases, a margin between positive pairs and negative pairs may be enforced to 50131251356v 1prevent the tokenization model from overfitting its representations to the threshold used for the definition of canonical clusters.[000132] At 358, the tokenization model is trained to determine a codebook of classification (CLS) tokens encoding an amino acid residue sequence and a three-dimensional structure of molecules. In some example embodiments, in addition to the multimodal tokens determined for each amino acid residue in a training sample, the tokenization model may be trained to assign each structure to one of a K quantity of discrete classification (CLS) tokens. In some cases, the assignment of a structure (e.g., a complementarity determining region (CDR) loop) to one of the K quantity of discrete classification (CLS) tokens may enable fast and computationally efficient structural comparisons for high-throughput queries. In some cases, in order for the tokenization model to learn the quantized tokens t for inclusion in the codebook C E W / fxd. a codebook learning loss fCodebook,u expressed as Equation (6) below may be applied on the classification (CLS) token of structure u.codebook, u||l|sg[tu] - tu||^ + a||tu|| ||2(6)wherein sg denotes the stop gradient operator and a is the weight on the second commitment loss term. While some example embodiments of the tokenization model may be trained to determine a codebook of discrete classification (CLS) tokens, it should be appreciated that in alternative embodiments, the tokenization model may be trained to determine continuous tokens for representing the three-dimensional structure of the molecules.[000133] In some example embodiments, the tokenization model may be trained using the overall loss function £ in Equation (7), which captures the respective losses for masked reconstruction of dihedral angles, masked reconstruction of amino acid residue identifies, contrastive learning of protein backbones, and codebook learning.51131251356v l^dihedral recon. + AA + ^contrastive T ^codebook "b ^dihedral reg. (7)[000134] FIG. 4 depicts a schematic diagram illustrating an example of the planes present in the amino acid residues of a portion of a protein molecule 400, in accordance with some example embodiments. Referring to FIG. 4, the portion of the protein molecule 400 shown includes a first amino acid residuea second amino acid residue R2, and a third amino acid residue R3. In some cases, the dihedral representation of the conformation of the protein molecule 400 includes the dihedral angles (p between the peptide planes of each pair of amino acid residues present in the protein molecule 400. For example, the matrix below is an example of the dihedral representation of the conformation of the protein molecule 400 in instances where the protein molecule 400 includes a n quantity of amino acid residues. Each entry in the i-th column and j-th row of the matrix is occupied by the dihedral angles (pi between i-th amino acid residue Rtand the j-th amino acid residue Rj in the amino acid residue sequence of the protein molecule 400. / ?i R2R3Rn^2 ^3:&n [—< P1,2 < P1,3 ■" < Pl,n<j°2,l—< P2,3 < P2,n ^3,1 <^3,2—Wz.n::: —:'Pn.l ‘Pn. S ~ ][000135] To further illustrate the concept of dihedral angles, FIG. 4 shows a first peptide plane 415 of the first amino acid residue R3and a second peptide plane 425 of the second amino acid residue R2. In some cases, the dihedral angle <p1;2between the first amino acid residue R3and the second amino acid residue R2is the dihedral angle formed by the first peptide plane 415 of the first amino acid residue R and the second peptide plane 425 of the second amino acid residue R2. Moreover, although the first amino acid residue R3and the third amino acid residue R3are not adjacent to one another, the dihedral angle <p1;3between the first amino acid residue R52131251356v land the third amino acid residue R3is the dihedral angle formed by the first peptide plane 415 of the first amino acid residue RLand a third peptide plane (not show) of the third amino acid residue R3- [000136] FIG. 5A depicts a schematic diagram illustrating an example of a process 500 for sequence-structure coherence based generation of protein sequences, in accordance with some example embodiments. In the example shown in FIG. 5 A, the input molecule 150 includes the variable domain of the heavy chain (VH) of an antibody as well as the variable domain of the light chain (VL) of the antibody. In some cases, the canonical conformation computation model 115 may be applied to generate a canonical conformation signature of the input molecule 150. For example, the canonical conformation computation model 115 may generate the canonical conformation signature by determining, based at least on the amino acid residue sequence of the input molecule 150, a probability of the input molecule 150 exhibiting the canonical conformation of each possible canonical conformation in a set of canonical conformation classes. In the example shown in FIG. 5A, the set of canonical conformation classes include canonical conformation classes associated with the heavy chain of an antibody (e.g., canonical conformation classes Hl, H2, and H3) as well as canonical conformation classes associated with the light chain of an antibody (e.g., canonical conformation classes LI, L2, and L3). Accordingly, the canonical conformation signature of the input molecule 150 may include a probability distribution across the canonical conformation classes H1-H3 and L1-L3. As shown in FIG. 5 A, the similarity between two canonical conformation signatures may be quantified by a similarity metric measuring a difference between two probability distributions (e.g., Kullback-Leibler (KL) divergence, Wasserstein metric, and / or the like).53131251356v l[000137] FIG. 5B depicts a schematic diagram illustrating an example of a molecule design system 550 performing canonical conformation signature guided protein sequence generation, in accordance with some example embodiments. Referring to FIGS. 1 and 5B, in some cases, the molecule design system 500 may include an encoder 560 (e.g., a convolutional encoder) that is shared between a classifier 565 (e.g., a token classifier) and a canonical class classifier ensemble 570 (e.g., an ensemble of N canonical class classifiers). In some cases, the classifier 565 (e.g., the token classifier) may implement the molecule design computation model 125 shown in FIG. 1A while the canonical class classifier ensemble 570 may implement the canonical conformation computation model 115 shown in FIG. 1A. In the example shown in FIG. 5B, the encoder 560 may be trained to generate a representation of an input sequence 555 for both the classifier 565 (e.g., the token classifier) to generate novel protein sequences and for the canonical class classifier ensemble 570 to determine a corresponding canonical conformation signature for guiding the generation of the novel protein sequences. In some cases, the classifier 565 may generate the novel protein sequences by performing a discrete diffusion denoising task that includes determining the identity of the amino acid residue for each sequence position. In some cases, the output of the classifier 565 may include, for each sequence position, an amino acid logit specifying the probability of each residue type.[000138] In some cases, the canonical class classifier ensemble 570 may include an N quantity of ensemble elements, each of which include a branch-specific encoder 560 and different classifier heads for determining the canonical conformation classes of different portions (or regions) of the input sequence 555. For example, in instances where the input sequence 555 corresponds to an antibody, each ensemble element may include six classifier heads for predicting the canonical conformation classes of the complementarity determining regions (CDRs) of the54131251356v linput sequence 555, such as the first, second, and third complementarity determining regions on the heavy chain and light chain of the input sequence 555 (e.g., CDRH1, CDRH2, CDR H3, CDR LI, CDRL2, and CDR 3). In some cases, implementing the canonical conformation computation model 115 shown in FIG. 1A with the ensemble design depicted in FIG. 5B may improve prediction robustness and provide epistemic uncertainty estimates for guidance under imperfect information. In some cases, the generation of novel protein sequences by the classifier 565 (e.g., the token classifier) may be guided by predictions from all N quantity of ensemble elements in the canonical class classifier ensemble 570. For instance, where the input sequence 555 exhibits one or more properties of interest, deviation from the canonical conformation of the input sequence 555 (or specific portions or regions therein) may be quantified based on a similarity metric (e.g., Kullback-Leibler (KL) divergence, Wasserstein metric, and / or the like) determined based on the output from all N quantity of ensemble elements. In some cases, guidance from the similarity metric may steer the generation of the novel protein sequences (e.g., the discrete diffusion denoising task performed by the classifier 565) towards preserving the canonical conformation of the input sequence 555.[000139] In some example embodiments, a canonical conformation classifier, whether implementing the canonical conformation computation model 115 individually or as a part of an ensemble (e.g., the canonical class classifier ensemble 570 in FIG. 5B), may be trained with training samples curated through round-robin partition rebalancing (RPR). In some cases, standard balanced sampling approaches may be inadequate for addressing extreme class imbalance, as is the case in complementarity determining region (CDR) canonical class datasets in which rare canonical classes may be underrepresented by orders of magnitude. In some cases, round-robin partition rebalancing (RPR) may include stateful minibatch construction that cycles55131251356v lthrough data partitions while maintaining per-partition sampling state. In some cases, round-robin partition rebalancing (RPR) may address the limitation that empirical risk minimization tends to be dominated by frequent canonical conformation classes by reducing (or minimizing) the ideal loss Lideal expressed by Equation (1) below where each canonical conformation class c contributes equally. In some cases, round-robin partition rebalancing (RPR) may ensure uniform representation of rare canonical classes while maintaining computational efficiency compared to conventional sampling techniques, such as stratified oversampling.^ideal(^)—^Sc6C]^r[S(x7,y7)GPc((XpT / ); (1)wherein Pcdenote data partitions defined by categorical feature combinations and M = |C| are the number of partitions.[000140] Table 3 below depicts pseudocode implementing an example of the algorithm for implementing round-robin partition rebalancing (RPR).[000141] Table 3Algorithm: Round-Robin Partition Rebalancing (RPR)Input: Dataset T>, partitions {Pc}cee, minibatch size BOutput: Balanced minibatch BInitialize: Shuffle partition order Op and example indices within each partition;Set partition pointer kP= 0 and example pointers kE c= 0 for all c E C\for i = 0 to B — 1 doc <- Op[kp] / / Select current partitionAdd example Pc[k£ c] to minibatch B;kE,c (fcE,c + 1) mod lpcl / / Advance example pointerif kE c= 0 thenShuffle examples in Pc;kP<- (kP+ 1) mod M / / Advance partition pointerif kP= 0 thenShuffle partition order Opreturn B56131251356v l[000142] In some example embodiments, the canonical conformation computation model 115 and the molecule design computation model 125 in FIG. 1A may undergo multitask training in which each gradient step randomly selects either the canonical conformation computation model 115 or the molecule design computation model 125 for optimization. In the example of the molecule design system 550 shown in FIG. 5B, multitask training may include, at each gradient step, optimizing either the diffusion head of the classifier 565 or a canonical conformation class classifier heads in one of the N ensemble elements of the canonical class classifier ensemble 570. In some cases, subsequent to sampling a task, a balanced minibatch may be drawn from the corresponding task dataset (e.g., since the training data for the molecule design computation model 125 may include unlabeled training samples), compute the task loss, and perform a gradient update on the weights of the active subnetwork of the current task. For canonical conformation classification tasks under input corruption, label smoothing may be applied with corruption-dependent coefficient as expressed by Equation (2) below.ysmooth— — at)ytrue + ~ I\ (2) wherein atdenotes the masking fraction, K denotes the number of canonical conformation classes, and ytrueis the one-hot encoded ground-truth label.[000143] FIG. 6A depicts a schematic diagram illustrating an example of the tokenization model 116 generating a multimodal representation 605 of a molecule, which in this case is the complementarity determining region (CDR) 600 of an antibody. As shown in FIG. 6A, in some cases, the tokenization model 116 may include one or more transformer blocks. In some cases, the complementarity determining region 600 may include an amino acid residue sequence 601 and one or more backbone dihedral angles 602 (e.g., < >, i, m). In some cases, each amino acid residue in the complementarity determining region 600 may include a type of amino acid residue57131251356v l(e g., of the twenty canonical amino acid residues), which is combined with the corresponding backbone dihedral angles < >, i, c by elementwise addition. In some cases, the tokenization model 116 may encode the sequence 601 and the backbone dihedral angles 602 of the complementarity determining region 600. In some cases, the resulting multimodal representation 605 may include, for each amino acid residue in the complementarity determining region 600, a multimodal token encoding the identity and the backbone dihedral angles of the amino acid residue. In some cases, the multimodal representation 605 may further include a classification (CLS) token encoding the overall three-dimensional structure of the complementarity determining region 600.[000144] FIG. 6B depicts a schematic diagram illustrating an example of a process for training the tokenization model 116 on a contrastive learning objective, in accordance with some example embodiments. Referring to FIG. 6B, in some cases, the tokenization model 116 may be trained on the contrastive objective to generate similar multimodal representations for similar three-dimensional structures and dissimilar multimodal representations for dissimilar three-dimensional structures. In the example shown in FIG. 6B, the tokenization model 116 may be applied to generate the multimodal representation 605 encoding the amino acid residue sequence 601 and the backbone dihedral angles 602 of the complementarity determining region (CDR) 600. Furthermore, in some cases, the tokenization model 116 may be applied to generate a multimodal representation 607 encoding the amino acid residue sequence 603 and the backbone dihedral angles 604 of a different complementarity determining region 610. In some cases, the contrastive loss may be a binary cross entropy loss between the classification (CLS) token included in the multimodal representation 605 and the classification (CLS) token included in the multimodal representation 607. In some cases, where the complementarity determining region 600 and the complementarity determining region 610 is a positive pair exhibiting sufficient structural58131251356v lsimilarities (e.g., same length and dihedral angle distance T> satisfying one or more thresholds), one or more parameters (e.g., weights, biases, and / or the like) of the tokenization model 116 may be adjusted to increase (or maximize) the similarity (e.g., cosine similarity) between the classification (CLS) tokens of the corresponding multimodal representations 605 and 607. In some cases, where the complementarity determining region 600 and the complementarity determining region 610 is a negative pair exhibiting sufficient structural dissimilarities (e.g., different lengths or dihedral angle distance T) satisfying one or more different thresholds), one or more parameters (e.g., weights, biases, and / or the like) of the tokenization model 116 may be adjusted to decrease (or minimize) the similarity (e.g., cosine similarity) between the classification (CLS) tokens of the corresponding multimodal representations 605 and 607. In some cases, where the complementarity determining region 600 and the complementarity determining region 610 satisfy neither the criteria for a positive pair nor the criteria for a negative pair (e.g., 0.1 < T> < 0.47), the complementarity determining region 600 and the complementarity determining region 610 may be disregarded, for example, by being excluded from the training of the tokenization model 116.[000145] FIG. 6C depicts a schematic diagram illustrating an example of a process for training the tokenization model 116 on masked reconstruction of backbone dihedral angles and amino acid residue identities, in accordance with some example embodiments. As shown in FIG.6C, in some cases, the tokenization model 116 may be trained to recover masked amino acid residue identities, masked dihedral angles, and combinations thereof. FIG. 6C depicts several masking approaches 650. For example, in some cases, the identities and backbone dihedral angles (e.g., < >, i, m) of some but not all of the amino acid residues in a protein molecule serving as a training sample may be masked. In some cases, the identities of some but not all of the amino acid residues may be masked while none of the corresponding backbone dihedral angles are masked.59131251356v lIn some cases, some but not all of the backbone dihedral angles may be masked while none of the amino acid residue identities are masked. In some cases, the identities of every amino acid residue may be masked but not any of the corresponding backbone dihedral angles. In some cases, the backbone dihedral angles of every amino acid residue may be masked but not the identity of any of the corresponding amino acid residues. In some cases, prior to being trained to reconstruct masked backbone dihedral angles and amino acid residue identities, the tokenization model 116 may be pretrained, for example, on the pretraining objective 655, to generate multimodal representatives from which a sequence encoder is able to recover the corresponding amino acid residue sequence and a structure decoder is able to recover the corresponding backbone dihedral angles.[000146] FIG. 6D depicts a schematic diagram illustrating an example of a process for training the tokenization model 116 on codebook learning, in accordance with some example embodiments. Referring to FIG. 6D, in some cases, the tokenization model 116 may be trained to generate, for the complementarity determining region (CDR) 600, the multimodal representation 605 to include a classification (CLS) token encoding the overall structure of the complementarity determining region 600. In some cases, the codebook learning may include the tokenization model 116 being trained to assign, to the complementarity determining region 601, a discrete classification (CLS) token. In some cases, the codebook may include a limited quantity of possible classification (CLS) tokens, each of which corresponding to a cluster of complementarity determining regions (CDRs) exhibiting sufficient structural similarity.[000147] Experimental Examples for Canonical Conformation Signatures [000148] A molecule design computation model, such as the canonical conformation computation model 125 shown in FIG. 1A, was implemented as a guided discrete diffusion model60131251356v lwith a canonical conformation computation model (e.g., the canonical conformation computation model 115 shown in FIG. 1A) implemented using six complementarity determining region (CDR) specific classifiers. The molecule design computation model was jointly trained with the canonical conformation computation model. During generation, the molecule design computation model is guided to generate novel protein sequences to conform to the canonical conformation signature of a reference molecule (e.g., the input molecule 150, the seed molecule 160, and / or the like), which includes canonical class probabilities for individual complementarity determining regions (CDRs). The reference molecule exhibits one or more properties of interest, including binding affinity towards three different target antigens. The molecule design computation model generated novel protein sequences with binding rates ranging from 15-60% of the three target antigens and demonstrated a 3 — 5 x improvement in binding affinity over the reference molecule for two of the three target antigens. Experimental structure prediction performed for the binders of two target antigens further reveal that the novel protein sequences generated by the molecule design computation model, while exhibiting biophysically non-trivial mutations, still maintained the backbone complementarity determining region (CDR) canonical conformation of the reference molecule. The canonical conformation computation model also successfully identified, within the novel protein sequences, binders with an accuracy of 68%.[000149] The molecule design computation model and the canonical conformation computation model was jointly trained on two datasets, the SabDab set of sequences and the pOAS. While the SabDab represents the collection of sequences from the PDB crystallographic and cryo-EM data, pOAS is the set of publicly released sequences obtained from repertoire sequencing studies, without accompanying crystallographic structural data. The structures of the pOAS sequences are predicted using ABB2. The molecule design computation model and the canonical61131251356v lconformation computation model was jointly trained using canonical cluster labels from both the SabDab crystallographic dataset and the folded pOAS dataset to allow exposure to a large quantity of sequence diversity.[000150] FIG. 7 depicts the canonical cluster landscape for SabDab (PDB) and the folded pOAS dataset. In FIG. 7, the panels show the proportion of the total dataset assigned to each cluster (y-axis) for six complementarity determining regions (CDRs). Canonical clusters (praxis) are displayed beginning with the CDR identifier (e.g., Hl), followed by the sequence length (e.g., 13), and finally the specific cluster within that length-for example, Hl-13-1. Both the PDB and pOAS datasets contain defined clusters (or classes) for all CDRs, as well as some undefined “noise” clusters (denoted by *). For example, CDR H3 contains the largest proportion of sequences within these noise clusters. Unlike defined clusters, noise clusters do not correspond to a consistent structural pattern, Instead, noise clusters contain CDR sequences that do not fit into any of the defined structural clusters for that length. Overall, PDB and pOAS canonical cluster landscapes exhibit similar trends. However, the pOAS does consistently show a higher proportion of sequences assigned to the noise cluster for H3, particularly for longer CDR lengths. The PDB dataset and the joint PDB and pOAS dataset was used to train two independent model instances to predict the canonical conformations of the six CDRs of an antibody sequence.[000151] Table 4 shows the test accuracies of the model trained on (i) the PDB dataset and on (ii) the combined PDB and pOAS dataset. The performance of the model critically improves for H3 and L3 CDRs with the addition of data from the pOAS dataset, notwithstanding the increase in the number of canonical clusters within the pOAS dataset. However, for both model instances, the accuracy of H3 and L3 predictions remain significantly lower compared to other CDRs, likely due to the greater prevalence of noise clusters in H3 and L3 rendering prediction62131251356v lmore challenging. The increase in the number of noise clusters in the pOAS data is reflected in the slight decrease of accuracy for Hl for the combined PDB and pOAS model on the PDB test set (the total number of clusters increase from 14 to 22 for Hl by addition of 7 noise clusters). Both the PDB model and the joint PDB and pOAS model perform better on test sets drawn from their own respective datasets. Notably, the joint PDB and pOAS model achieves comparable (or higher) performance to the PDB model across both test sets, and significantly outperforms it on H3 and L3 predictions for both test sets.[000152] Table 4CDR ^cc™+,P°ASHl 0.78 0.69 0.76 0.89 H2 0.81 0.59 0.83 0.96 H3 0.46 0.16 0.74 0.81 LI 0.76 0.72 0.82 0.95L2 0.83 0.74 0.80 0.96L3 0.63 0.27 0.72 0.85[000153] Potential biases arising due to the choice of folding method was investigated by comparing whether the canonical conformation computation model trained using sequences and canonical class labels from ABB2 folded structures exhibited an over-representation of certain canonical clusters when compared to labels derived from a different folding method. This evaluation included benchmarking the structurally annotated canonical classes from ABB2 as well as ESMFold folded structures to predictions made using canonical conformation signatures for a randomized test set of 2000 antibody sequences from pOAS (held out from model training). Canonical classes of the ABB2 and ESMFold folded structures were determined using the dihedral63131251356v langle based clustering method. The results of the comparison are shown in FIG. 8. The frequency of antibody CDRs assigned to each canonical class by the three methods were compared (in FIG.8(a)) and, given the low frequency of some class predictions, this data was also represented as a percentage to enable better visualization of differences within low-represented classes (in FIG.8(b)). More specifically, FIG. 8(a) shows the frequency of assignment of CDR sequences to a cluster, out of 2000 total sequences. Each color represents a distinct cluster I D. within a given CDR length (depicted on the x-axis). Three bars within each CDR length represents the assignment from ABB2, ESMFold and the models respectively from left to right. FIG. 8(b) shows the percentage allocation of clusters across CDR lengths to highlight variations within underrepresented classes.[000154] Overall fairly high levels of agreement between the methods were observed. Many CDR lengths contained only one canonical class prediction by all three approaches; for example, H2-9 (all H2-9-2), Ll-16 (all Ll-16-1) and Ll-17 (all Ll-17-1). All CDR H3 longer than 14 amino acids were assigned entirely to noise clusters as no canonical classes exist for H3 with such lengths within this dataset, as is the case for light chain CDRs L2-13, L3-6, L3-7, L3-12 and L3-14. L3-9 had the most variation in canonical conformations, with all four canonical classes predicted by each method, in addition to noise clusters. Overall, the predictions made using canonical conformation signatures aligned closely with ABB2 structural labels over the set of 2000 sequences, especially for Hl, H2, LI, and L2 (> 95% agreement), while H3 and L3 showed slightly lower agreement between canonical conformation signatures and ABB2 labels (86% and 90% agreement, respectively). Canonical conformation signatures showed slightly lower agreement with ESMFold structural labels, with drops in percentage agreement for H2 (84%), H3 (87%), LI (75%), and L3 (78%), while Hl (93%), and L2 (94%), showed strong agreement between labels.64131251356v l[000155] Differences within rare clusters that appear in limited number of sequences are highlighted by analyzing the percentage allocation across clusters (FIG. 8(b)), which improved the visibility of variations within underrepresented classes. Agreement between structural assignments made using canonical conformation signatures and dihedral-angle based clustering of ABB2 folded structures for these classes was generally expected given the canonical conformation computation model was trained on ABB2 folded structures. Differences were observed primarily across CDR H3, with varying levels of noise cluster assignments. This was particularly evident for H3-7 where 100% of ABB2 structures were classified as H3-7-1 but 50% were predicted as noise with canonical conformation signatures (FIG. 8(b)). By contrast, differences between ABB2 (or canonical conformation signatures) and ESMFold were notable for lengths Hl-14, H2- 10, H2-11, H2-12, Ll-13 andLl-14. For Hl-14, 100% of ESMFold models adopt canonical class Hl-14-2 rather than Hl-14-1 which was adopted by canonical conformation signatures and ABB2. For H2-11 and H2-12 CDRs, only ESMFold structures were classed into noise clusters. Across the light chain CDRs, ESMFold structures were more frequently assigned to noise clusters than the other two methods (e.g. Ll-11, Ll-14, L3-8, L3-10, L3-13). Interestingly, ESMFold structures are less likely to adopt L3-10-cis7-cis8 canonical class but instead favor other L3-10 canonical forms. It’s worth noting that many of these clusters showing large differences across the three methods are observed within a smaller set of sequences - often fewer than 10 - and are likely underrepresented within the pOAS data.[000156] The molecule design model generates antibody CDR sequences from a seed molecule (e.g., a known binder) as the starting point. The binder status of the seed molecule ensures that a functional binding mode exists, and the novel protein sequences generated therefrom are mutated to be diverse while still maintaining the original CDR conformation of the seed. When65131251356v lthe seed molecule contains canonical CDRs, mutations can be applied to all CDR regions to introduce diversity. In contrast, for seed molecules with non-canonical CDRs (CDRs assigned to noise clusters), mutations are selectively restricted to canonical CDR regions, leaving non-canonical CDR sections intact to preserve the “noisy” binding mode of the original molecule.[000157] To further illustrate, antibody sequences were generated for four seed molecules with canonical CDR conformations. Each of the six CDRs was masked (in independent and pairwise combinations), and sequences were generated to introduce diversity within the masked regions. The resulting edit distances of CDR designs, shown in FIG. 9(c) (average edit distances reported in Table 5 below), depends on the CDR sequence and its canonical cluster. While the edit distances in FIG. 9(c) (and Table 5) relate to individual CDRs, it indicates that mutations spanning multiple CDRs can generate substantial diversity, often pushing total edit distances to well beyond 25-30. During canonical conformation guided generation, these edits can be limited to select CDR regions, or distributed widely across all CDR regions.[000158] Table 5PDB Hl H2 H3 LI L2 L3 la7p 7.14 5.6 6.24 3.95 5.78 5.85 (Hl-13-1) (H2-9-1) (H3-10-1) (Ll-11-2) (L2-8-4) (L3-9-4) Idqm 6.37 3.02 5.43 3.44 5.94 5.02 (Hl-13-1) (H2-9-1) (H3-10-1) (Ll-11-2) (L2-8-4) (L3-9-cis7-l) Ijfq 5.88 8.27 7.45 5.27 4.48 4.98 (Hl-13-1) (H2-9-1) (H3-10-1) (Ll-11-2) (L2-8-4) (L3-9-cis7-l) 3v0w 4.09 8.19 7.83 6.70 3.61 4.86 (Hl-13-1) (H2-9-1) (H3-10-1) (Ll-11-2) (L2-8-4) (L3-9-cis7-l)[000159] FIG. 9(a) and 9(b) illustrate the CDR and sequence logo plots respectively for sequences generated from starting seed la7p. The generated CDR sequences in FIG. 9(a) (top)66131251356v lstrongly resemble their canonical cluster motif (bottom), with diversity being constrained by the patterns present in starting seed CDR sequence. For example, although both la7p and 3v0w belong to Hl -13-1 canonical cluster, la7p exhibits greater mutational diversity on CDR Hl and a higher average edit distance (7.14) compared to 3v0w (4.09) (Table 5).[000160] Some canonical conformation signature guided designs feature “locked-in” effect, where a single residue, or groups of residues, dominate, repeating themselves through all designs. This phenomenon is expected, as canonical conformation signature guided sequence design inherits dominant patterns from the canonical cluster, however this also reduces diversity of the generated designs. To minimize locked-in effect, masking can be strategically applied, particularly at positions that contain dominating motifs. Random sampling (all CDR positions masked), on the other hand, is more likely to produce designs that consistently contain repeating “locked-in” patterns in every sequence.[000161] FIG. 9(b) illustrates the sequence logo for la7p designs created with canonical conformation signature guidance. On Hl, dominant “locked-in” mutations included G to W at Aho position 40 (corresponding to position 11 on the CDR logo plot), G to S at Aho position 33 (position 9 on the CDR logo plot), V to I or M at Aho position 41 (position 12 on the CDR logo plot), and N to S at Aho position 42 (position 13 on the CDR logo plot). These mutations follow position-specific residue probabilities for Hl- 13-1 canonical cluster, hence the tendency of designs to mutate to these residues.[000162] The performance of canonical conformation signatures as a scoring method was tested on a set of designs generated for EGFR, IL6, and OSM targets. In total, 1609 designs, including of 958 binders and 651 non- binders / non-expressors, were generated and experimentally tested for binding to the above targets, using machine learning methods and datasets (no designs67131251356v lgenerated with guidance from canonical conformation signatures were included in this analysis). FIG. 10(a) visualizes canonical conformation signature KL divergences of seed molecules to designs for these sequences. Here, the KL divergence quantifies the difference between canonical conformation signature predicted canonical class probabilities of each design from its seed. A divergence close to zero suggests strong alignment between seed and design canonical class probabilities, indicating similar complementarity determining region (CDR) backbone conformations. As seen in FIG. 10(a), binders exhibit a pronounced peak at zero KL divergence, whereas non-binders display a broader KL divergence distribution with lower and wider peaks, indicating significant differences from the seed. Across the six CDRs, the differences in peaks are most pronounced in LI, L3, and Hl, indicating that binders show prominent variations from nonbinders in these regions, while L2 exhibits the least difference amongst binders and nonbinders. For these set of designs, canonical conformation signature based binding scores are computed by comparing if the canonical conformation signature predicted canonical class matched that of the seed. Designs that matched were labeled ‘predicted binders,’ while those that did not were labeled ‘predicted non- binders’. Table 6 shows contingency table of predicted vs. true binders (and nonbinders) for this experiment. Canonical conformation signature achieved 68% accuracy in predicting binders using this approach, with 0.72 precision and 0.76 recall, meaning it identified positives with high accuracy, and identified most of the true binder designs. Furthermore, a chi-squared test of independence showed a significant association between canonical conformation signature predictions and experimental binding outcomes (%2 = 170.23, p < 0.05), indicating strong agreement.[000163] In a separate experiment, the performance of canonical conformation signature guidance in naive binder generation task was evaluated by generating designs for three68131251356v ltargets: IL6, EGFR, and anonymous target A, based on lead antibody sequences as seed molecules. For IL6 and EGFR, a single seed sequence (IL6-1409 and EGFR- 1687 respectively) was tested in each case, whereas target A was tested using two distinct seeds. FIG. 10(b) shows results for binder generation and corresponding SPR based binding affinity of designs. Here, binding affinity is defined as the negative log-transformed equilibrium dissociation constant: pKD=— 1, with kaand kddenoting the dissociation and association rateK-a'constants, respectively. Each design set includes antibody designs derived from an antibody ‘lead’, a known antigen binder with given pKD, and binding affinity differences, pKD, are computed relative to this lead (also termed ‘seed’). For the three antigen targets, canonical conformation signature guidance successfully generated binders on the first generation without any auxiliary target-specific data. In the case of EGFR and target A, canonical conformation signature guidance produced two binders having significant improvement in binding affinity over the seed during the first round of generation. Success for target A was notably seed-dependent, as the second seed for target A failed to produce any binders. The overall binding rate across all targets was 22%; designs based on the successful seed for target A yielded a 19% binding rate, while IL6 and EGFR yielded 57% and 15% binding rates, respectively. Importantly, none of these designs required any data beyond the seed sequence for binder generation. In the absence of random mutagenesis data as the ideal baseline, repertoire-based picks was used for comparative benchmark of ApKD. Repertoire picks are sequences from the seed clonotype that are presumed to bind the target antigen and were selected using internal scoring methods based on trained protein language models. Repertoire picks were sampled across a range of scores to reduce potential bias or inaccuracies in the scoring functions. Notably, the top binder generated with canonical conformation signature guidance outperformed all the repertoire picks for EGFR and had > 5 x69131251356v 1improvement in binding affinity over seed, ( pKD = 0.75) (FIG. 10(b)). For target A, the top binder generated with canonical conformation signature guidance had a > 3 X improvement in binding affinity over the corresponding seed molecule ( pKD = 0.48), and was — 2 x better than the top repertoire picked binder sequence. For IL6, canonical conformation signature guidance achieved a high binding rate but did not surpass the seed or any of the repertoire picks in affinity. These findings demonstrate that canonical conformation signature guidance can yield binders that are competitive with, and sometimes superior to, repertoire informed picks without any auxiliary data except the seed sequence.[000164] Table 6Predicted Non-binder Predicted Binder Non-binder (true) 363 288Binder (true) 227 731[000165] To examine underlying backbone conformations, crystal structures were determined for the original lead molecule and the top performing binders for both EGFR and IL6 generated with canonical conformation signature guidance. For EGFR, the seed and binder crystal structures were determined to 1.78A and 2.4A resolution respectively. The optimized EGFR binder incorporates three mutations in the CDR H2 loop- Q53Y, T56Y, and I57V, designed to promote binding. FIGS. 10(c) and (d) show the EGFR seed and design structures overlaid, and the CDR backbone structures which conform to each other. CDR backbones are rendered as sticks to show alignment, with side chains hidden for clarity- except for the mutations highlighted in FIG. 10(d). FIG. 10(d) shows each of the three mutations in detail, showing the side chain of the original residue as well as mutated residue (all other side chains hidden for clarity). Despite the introduction of two biochemically disparate and large, ring-based mutations (Q53Y and T56Y),70131251356v lthe CDR backbone conformation of the design remained unaltered relative to the original seed. While the backbone is conserved, the side chain mutations introduce new hydrophobic interactions. For IL6, crystal structures for the lead IL6 Fab as well as the top IL6 binder were determined to a 1.9A and 2.61A resolution respectively for structural comparison. Three mutations - Y32F, A55G, and T53S, were introduced in CDR Hl and H2 of the heavy chain to promote binding. FIGS. 10(e) and (f) show the seed and the design structures overlaid, their CDR backbones in stick representation, and the aforementioned mutations. Sidechains are hidden for clarity, except in FIG. 10(f), where they are shown for the mutated residues and their original counterparts. Throughout the design, and across all CDRs, the backbone conformation remains unaltered relative to the seed, demonstrating the ability of canonical conformation signature guidance for CDR backbone conforming binder design.[000166] Experimental Examples for Multimodal Representations [000167] The performance of the multimodal representations applied towards paratope retrieval was evaluated for a set of query complementarity determining regions (CDRs) for which the multimodal representations generated by a tokenization model (e.g., the tokenization model 116 in FIG. IB) was used to retrieve, from a repository of complementarity determining regions (CDRs), those with the closest experimentally determined backbone structure. This evaluation directly assesses how well various example embodiments of the multimodal representations described herein captures paratope-level three-dimensional structure.[000168] Query CDRs are from a test set of CDRs unseen by the tokenization model during its training while the repository CDRs are those from the train and validation set of SAbDab. The classification (CLS) tokens t from the multimodal representations generated by the tokenization model were used to retrieve 20 complementarity determining region (CDR) loops71131251356v lwith the highest cosine similarity from loops of the same type and length. Retrieved CDRs are deemed correct if dihedral angle distance D < 0.47 (Equation (5))) or RMSD < 1A to the query CDR loop.[000169] Paratope retrieval performance using the classification (CLS) token t was compared to embeddings from protein language models trained on UniProt: ESM C (ESM Team, 2024) and ESM-2 (3B, and models trained on OAS: AbLang2 and IgBert. Additionally, the quality of retrieval when structure is also given was also evaluated and compared to multimodal protein language models, SaProt and ProstT5, which also take as input the Foldseek 3Di tokens derived from the protein structure. To ensure embeddings focus on the loop, for models which embed the whole protein sequence, the loop embedding is defined as the mean embedding over the amino acids in the loop. Paratope retrieval performance the classification (CLS) token t was also compared to that of continuous structure tokenizers, including inverse folding models MIF and ProteinMPNN, and the continuous encoder embedding of VQVAE models Foldseek 3Di and Amino Aseed.[000170] The models were evaluated with precision at rank 20 and the classification (CLS) token t achieved state-of-the-art similar paratope retrieval performance, based on dihedral distance D < 0.47 or RMSD < 1A from loops of the same length when given sequence and dihedral angle input (Table 7). Compared to larger protein language models pretrained with masked language modeling, the tokenization model described herein and Foldseek achieve higher precision. For the retrieval of similar H3 loop structures based on dihedral distance D, various example embodiments of the tokenization model of the present disclosure outperform the best structure tokenizer, Amino Aseed, by 5.9% and the best protein language model or antibody72131251356v llanguage model, ESM-2, by 69.8%. The H3 loop is particularly hard to represent for sequence- only language models due to the high sequence diversity owing to V(D)J recombination.[000171] Table 7 below depicts the average precision at rank 20 for retrieval of similar CDR paratopes. Different models are shown in rows. The first, second, and third best performance for each column are highlighted.[000172] Table 7Model RMSD < 1A D < 0.47LI L2 L3 Hl H2 H3 LI L2 L3 Hl H2 H3 Random 0.545 0.557 0.373 0249 0.351 0.127 0.648 0.730 0.392 0.559 0.508 0.126 ESM C 0.750 0.700 0.489 0418 0.519 0.19 0.811 0.916 0.517 0.692 0.702 0.208PLMESM-2 (3B) 0.740 0.704 0.500 0425 0.522 0.206 0.802 0.904 0.534 0.706 0.688 0.237 AbLang2 0.689 0.604 0.482 0402 0.497 0.173 0.761 0.782 0.537 0.602 0.699 0.222 AbLMT„,IgBert 0.705 0.622 0.482 0377 0.479 0.182 0.773 0.813 0.511 0.709 0.677 0.216 SaProt 0 737 0704 0499 0420 0491 0218 0790 0 918 0 578 0688 0646 0248 MPLM „ProstT5 0.782 0.716 0.539 0458 0.586 0.276 0.846 0.941 0.629 0.711 0.756 0.359 MIFF 0.776 0.699 0.516 0432 0.491 0.231 0.833 0.933 0.604 0.702 0.641 0.298 IFProteinMPNN 0.804 0.700 0.546 0459 0.521 0.286 0.839 0.943 0.632 0.732 0.710 0.372 Foldseek Di 0.785 0.696 0.556 0.467 0.591 0.281 0.849 0.909 0.640 0.715 0.730 0.362 VQVAE...,Ammo Aseed 0.812 0713 0542 0420 0 529 0.292 0.851 0 952 0625 0688 0713 0 379 Tokenization 0.793 0.705 0.558 0.459 0.578 0.278 0.851 0.956 0.674 0,715 0,749 0.402Model[000173] The performance of the classification (CLS) token t for clustering antibody structures is evaluated by using the codebook learned by the tokenization model to generate the predicted clustersC — {C1(while Q — {G1(■■■, GL} denotes the reference clusters. More specifically, the performance of the classification (CLS) token t for clustering antibody structures is evaluated based cluster purity, which quantifies the agreement between the predicted clusters73131251356v lC = {C •••, CK} induced by the codebook t and the reference clusters Q = {Gri, ■■■, GL}. For each predicted clusters Cfe, a dominate reference cluster may be selected based on majority vote: y*( )=argmax n G. Items in the predicted cluster Ckwhose reference label equals y*(fc) may be identified as being correctly assigned to the cluster Ck. In some cases, the overall accuracy of the predicted cluster Ckmay correspond to the proportion of correctly assigned items as shown in Equation (8) below.Purity(C,£) = ■^£=1max|CkCl G{\, N = SLilQl (8)[000174] All loops in SAbDab that can be assigned to a reference cluster with a cutoff of dihedral angle distance 2) = 0.47 to the centroid were evaluated. A limitation of the existing canonical clustering approach is that several loops are not assigned to any cluster. The cluster purity of those containing unassigned loops, which are typically referred to as belonging to “noise” clusters, are not disregarded for this evaluation.[000175] The reference definition of clusters assigns different clusters for different loop types and lengths. As such, the cluster’s loop-type purity and loop-length purity, which are defined as follows in Equation (9), are also evaluated.P?Pe= “max^xeckWoop type(x) = t}, p^11= max £XCQ Woop length(x) = (9)wherein t = {Hl, H2, H3, H4, LI, L2, L3, L4}. Global scores are reported with a weighted average of the cluster-level purity scores, Plype= nkp^ype, Plen= j^knkpk] n, wherein nkis the number of loops in the cluster and N = 'lnk.[000176] Across SAbDab, 1305 classification (CLS) token codebooks and 180 reference clusters are used. Without exposure to loop-type annotations, the clusters formed based on the classification (CLS) token t described herein are highly homogeneous, attaining a loop-type74131251356v lpurity of Ptype= 0.983, and loop length purity Plen= 0.965. Visualization of the latent space in 2D with UMAP also shows localization of loops by loop type, length, and canonical cluster. Cluster purity in reported in Table 8. The results achieved with the classification (CLS) token t are comparable those using Position-Specific Scoring Matrices to predict canonical forms from sequence. These results highlight the ability to recover the known canonical clusters with high purity.[000177] Table 8CDR Heavy Light CDR1 0.894 0.880 CDR2 0.900 0.975 CDR3 0.754 0.831 CDR4 0.983 0.930[000178] The distribution of different complementarity determining region (CDR) loop structures across codebooks was also evaluated. The proportion of each loop in SAbDab assigned to the top 20 used codebooks is shown in FIG. 11 for each loop type. The H4 and L2 loop types have relatively low diversity with 93.0% and 91.7% of loops assigned to a codebook in the top 20, respectively. Conversely, H3 has the lowest coverage in the top 20 codebooks with 14.6% of loops. The most frequent H3-loop codebook entry appears 387 times. Every occurrence shares an identical loop sequence derived from single-chain Fvl6 antibody structures, a scaffold that is widely represented in the PDB.[000179] The performance of the classification (CLS) token t when incorporated in binding affinity prediction is evaluated using the curated set of antibody-antigen binding affinity dataset from Ab-BiBench. Antibody structures for the variants are predicted and used as inputs to75131251356v lobtain, from the tokenization model described herein, corresponding multimodal representations. Embeddings are then generated with a property computation model (e.g., the property computation model 132 in FIG. IB) that incorporates the classification (CLS) token t of a complementarity determining region (CDR) loop inserted at the start of the loop sequence. Sequence-level embeddings are obtained by averaging residue-level embeddings. A separate model was trained for each antibody-antigen pair. Sequence-level embeddings are used as input to train a ridge regressor evaluated with 10-fold nested cross-validation. Models are evaluated with the Spearman correlation coefficient, p, between the predicted and true binding affinity. The protein language models, ESM C, ESM-2 (3B), AbLang2, and IgBert, and multimodal protein language models SaProt and ProstT5 are used as baselines for performance comparison. Sequence-level embeddings are obtained by averaging the residue-level embeddings. For the multi modal protein language models, Foldseek 3Di tokens of the predicted structures were used.[000180] Across the 10 antibody-antigen pairs in Table 9, the property computation model (PCM) incorporating the classification (CLS) token t surpasses the base model IgBert from which it is derived on 8 cases. It ranks first or second on 7 of the 10 pairs. The results also show that structure is not always beneficial, with the structure-only protein language model, ProstT5, which uses the Foldseek 3Di alphabet, and the sequence-structure protein language model, SaProt, not performing as well as sequence-only models. When the property computation model further incorporates the multimodal tokens of the individual amino acid residues in addition to the classification (CLS) token, a drop in performance is also observed, thus showing the benefit of incorporating the classification (CLS) token t alone when representing proteins that differ by a few mutations. As protein language models improve with scale, it is notable that that the property computation model described herein can be implemented as a 420M parameter model, which is76131251356v lmore than 7 x smaller than ESM-2, to achieve better performance on average across the 10 antibody-antigens.[000181] Table 9Target ESM C ESM-2 (3B) SaProt ProstT5 AbLang2 IgBcrt PCM Imlc 0.609 0.551 0.557 0.280 0634 0.665 0.616 (0.017) (0013) (0.020) (0.040) (0015) (0.015) (0.009) ln8z 0.673 0.635 0.637 0.351 0646 0.682 0.675(0.022) (0019) (0.028) (0.057) (0021) (0.023) (0025) 2fxg 0.809 0.752 0.754 0.355 0752 0.694 0.713(0.010) (0010) (0,012) (0.021) (0007) (0.013) (0.014) 3gbn_hl 0.901 0.953 0.915 0.638 0945 0.947 0.948(0.004) (0.003) (0.005) (0.013) (0004) (0.004) (0.004) 3gbn_h9 0.932 0.971 0.952 0.679 0963 0.961 0.962 (0.004) (0.002) (0.003) (0.017) (0003) (0.003) (0.003) 4fqi_hl 0.871 0.955 0.866 0.593 0883 0.898 0.921(0001) (0.001) (0001) (0002) (0001) (0001) (0001) 4fqi_h3 0.936 0.973 0.958 0.644 0969 0.970 0.971(0.001) (0.001) (0.001) (0.009) (0001) (0.001) (0.001) aayl49 0.617 0.584 0.584 0.301 0563 0.611 0.625(0.010) (0013) (0.012) (0.014) (0010) (0,010) (0010) aayl49_ML 0.518 0.524 0.487 0.320 0499 0.524 0.531(0.008) (0008) (0.009) (0.009) (0007) (0.007) (0.007) aayl51 0.576 0.516 0.524 0.260 0527 0.566 0.579(0.007) (0009) (0.008) (0.011) (0009) (0.010) (0.011)[000182] The performance of multimodal representation guided generation of novel protein sequences was evaluated by analyzing if the generated complementarity determining region (CDR) loops are consistent with the structure of the masked out loops. For the CDR1, CDR2, and CDR3 of the heavy and light chain, 50 structures were randomly sampled from SAbDab,. The classification (CLS) tokens t and the multimodal tokens X generated by the tokenization model of the dihedral angles and masked sequence are inputted to a language model trained to operate on both. Loop sequences are sampled from the resulting amino acid likelihoods77131251356v ldetermined by the language model for each of these antibodies before the sampled sequences are folded into three-dimensional structures.[000183] FIG. 12(a) shows the self-consistent root mean square distance (scRMSD) of the sampled complementarity determining region (CDR) loops stratified by sequence identity. As shown, the language model using multimodal structural representations excels at generating loops at different levels of sequence diversity while maintaining a similar structure, improving both on state-of-the-art antibody inverse folding models and the base model. As a particular example, the H3 loop of a SARS-CoV-2 neutralizing antibody from the PDB structure 7TCQ was redesigned. At a sampling temperature of 0.5, the language model sampled loops with an average edit distance of 6.6 from the loop of length 9 (FIG. 12(b)). The predicted structures of the sampled loops maintain the beta hairpin structure of the original loop with an average loop RMSD of 0.79A (FIG. 5(c)).[000184] The generative performance of the language model incorporating multimodal representations was evaluated against state-of-the-art inverse folding models for antibodies: AbMPNN (a version of ProteinMPNN finetuned on antibody structures) and AntiFold (a version of ESM-IF1 finetuned on antibody structures). For these models, the loop sequence was generated given the full backbone and the sequence of the rest of the antibody. Comparisons were also made against the base model, IgBert, and a version of the same language model incorporating the classification (CLS) token t but not the multimodal tokens x of the individual amino acid residues.[000185] For each loop and model, 10 sequences were sampled at the following sampling temperatures: = 0.01,0.05,0.1,0.2,0.5, 1.0,2.0. In total, for each model, 50 x 10 x 6 x 7 = 21,000 sequences were generated for the different structures, sequence78131251356v 1samples, loop types, and temperatures, respectively. The generated loop regions were aligned with the original structure to evaluate the self-consistency (sc) RMSD between the two loop structures. Sampling sequences at different temperatures was necessary to generate sequences with different levels of sequence identity to the original loop, since recapitulating the original loop sequence would achieve low scRMSD but would not be useful for the design of new H3 loops.[000186] FIG. 13 depicts a block diagram illustrating an example of a computing system 1300, in accordance with some example embodiments. Referring to FIGS. 1-13, the computing system 1300 may be used to implement the canonical conformation computation engine 110, the molecule design engine 120, the client device 140, and / or any components therein.[000187] As shown in FIG. 13, the computing system 1300 can include a processor 1310, a memory 1320, a storage device 1330, and input / output devices 1340. The processor 1310, the memory 1320, the storage device 1330, and the input / output devices 1340 can be interconnected via a system bus 1350. The processor 1310 is capable of processing instructions for execution within the computing system 1300. Such executed instructions can implement one or more components of, for example, the canonical conformation computation engine 110, the molecule design engine 120, the client device 130, and / or the like. In some example embodiments, the processor 1310 can be a single-threaded processor. Alternately, the processor 1310 can be a multi -threaded processor. The processor 1310 is capable of processing instructions stored in the memory 1320 and / or on the storage device 1330 to display graphical information for a user interface provided via the input / output device 1340.[000188] The memory 1320 is a computer readable medium such as volatile or nonvolatile that stores information within the computing system 1300. The memory 1320 can store data structures representing configuration object databases, for example. The storage device 133079131251356v lis capable of providing persistent storage for the computing system 1300. The storage device 1330 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input / output device 1340 provides input / output operations for the computing system 1300. In some example embodiments, the input / output device 1340 includes a keyboard and / or pointing device. In various implementations, the input / output device 1340 includes a display unit for displaying graphical user interfaces.[000189] According to some example embodiments, the input / output device 1340 can provide input / output operations for a network device. For example, the input / output device 1340 can include Ethernet ports or other networking ports to communicate with one or more wired and / or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).[000190] In some example embodiments, the computing system 1300 can be used to execute various interactive computer software applications that can be used for organization, analysis and / or storage of data in various formats. Alternatively, the computing system 1300 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and / or any other objects, etc.), computing functionalities, communications functionalities, etc. The applications can include various add-in functionalities or can be standalone computing products and / or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input / output device 1340. The user interface can be generated and presented to a user by the computing system 1300 (e.g., on a computer screen monitor, etc.).80131251356v l[000191] One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and / or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and / or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.[000192] These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and / or object-oriented programming language, and / or in assembly / machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and / or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and / or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and / or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or81131251356v la magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.[000193] To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT), a liquid crystal display (LCD), a light emitting diode (LED) monitor, or an organic light emitting diode (OLED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.[000194] In the descriptions above and in the claims, phrases such as “at least one of’ or “one or more of’ may occur followed by a conjunctive list of elements or features. The term “and / or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it is used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least82131251356v lone of A and B;” “one or more of A and B;” and “A and / or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and / or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.[000195] The subject matter described herein can be embodied in systems, apparatus, methods, and / or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and / or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and / or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and / or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.83131251356v l
Claims
CLAIMSWhat is claimed is:
1. A computer-implemented method, comprising:identifying an input molecule, where the input molecule comprises an amino acid residue sequence;generating, using a canonical conformation computation model, a canonical conformation signature for the input molecule, where the canonical conformation signature of the input molecule is based at least on the amino acid residue sequence of the input molecule or the canonical conformation signature of the input molecule is;modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the canonical conformation signature of the input molecule; andgenerating an output molecule using the modified amino acid residue sequence of the input molecule.
2. The method of claim 1, wherein the modifying the amino acid residue sequence of the input molecule is guided by the canonical conformation signature of the input molecule such that a conformation of the output molecule is similar to the conformation of the input molecule.
3. The method of claim 1, wherein the canonical conformation computation model is trained to generate the canonical conformation signature of the input molecule by at least determining, for each canonical conformation class in a set of canonical conformation classes, a probability of the input molecule exhibiting a canonical conformation associated with the canonical conformation class.
4. The method of claim 3, wherein the canonical conformation signature of the input84131251356v 1molecule comprises a probability distribution across each canonical conformation class in the set of canonical conformation classes.
5. The method of claim 3, further comprising:determining the set of canonical conformation classes based at least on a conformation of each known protein molecule of a plurality of known protein molecules.
6. The method of claim 5, wherein the set of canonical conformation classes are determined by at least clustering, into one or more clusters, a dihedral representation of the conformation of each known protein molecule, andwherein each cluster of the one or more clusters correspond to a canonical conformation class.
7. The method of claim 6, wherein each canonical conformation class includes two or more protein molecules whose conformations exhibit a threshold similarity.
8. The method of claim 3, wherein the set of canonical conformation classes include a first canonical conformation class associated with a heavy chain of an antibody and a second canonical conformation class associated with a light chain of the antibody.
9. The method of claim 8, wherein the amino acid residue sequence comprising the input molecule includes one or more tokens separating a plurality of amino acid residues forming the heavy chain of the antibody and a plurality of amino acid residues forming the light chain of the antibody.
10. The method of claim 1, further comprising:applying a molecule design computation model to modify the amino acid residue sequence of the input molecule.
11. The method of claim 10, further comprising:85131251356v 1generating an intermediate molecule by applying the molecule design computation model to modify the amino acid residue sequence of the input molecule;applying the canonical conformation computation model to determine, based at least on an amino acid residue sequence of the intermediate molecule, a canonical conformation signature of the intermediate molecule; andidentifying the intermediate molecule as the output molecule based at least on the canonical conformation signature of the intermediate molecule exhibiting a threshold similarity to the canonical conformation signature of the input molecule.
12. The method of claim 11, further comprising:determining a similarity metric quantifying a similarity between the canonical confirmation signature of the intermediate molecule and the canonical conformation signature of the input molecule.
13. The method of claim 12, wherein the similarity metric comprises a Kullback-Leibler (KL) divergence or a Wasserstein metric.
14. The method of claim 11, wherein the intermediate molecule is identified as the output molecule further based on the canonical conformation of the intermediate molecule exhibiting a greater similarity to the canonical conformation signature of the input molecule than a canonical conformation signature of another intermediate molecule.
15. The method of claim 11, wherein the intermediate molecule is identified as the output molecule further based on the intermediate molecule exhibiting one or more desirable properties.
16. The method of claim 1, wherein the amino acid residue sequence of the input molecule is modified by inserting, deleting, and / or changing an identity of one or more amino86131251356v lacid residues in the amino acid residue sequence.
17. The method of claim 1, wherein the input molecule comprises an antigen receptor.
18. The method of claim 1, wherein the input molecule comprises a complementarity determining region (CDR) of an antibody.
19. The method of claim 1, wherein the input molecule comprises a variable domain of a heavy chain of an antibody and / or a variable domain of a light chain of the antibody.
20. The method of claim 1, wherein the canonical conformation computation model has been trained based on a training dataset that includes, for each known protein molecule of a plurality of known protein molecules, an amino acid residue sequence of the known protein molecule and a ground-truth canonical computation signature.
21. The method of claim 1, wherein the molecule design computation model comprises a diffusion model that has been trained to generate the output molecule by making incremental modifications to the amino acid residue sequence of the input molecule.
22. The method of claim 1, wherein the canonical conformation computation model comprises an ensemble of an N quantity of ensemble elements, and wherein each ensemble element includes a canonical conformation classifier trained to determine a canonical conformation signature of a different one of a complementarity determining region (CDR) in the input molecule.
23. The method of claim 22, wherein the canonical conformation signature of the input molecule is determined based on an output from every ensemble element of N quantity of ensemble elements.
24. A system, comprising:at least one data processor; and87131251356v 1at least one memory storing instructions, which when executed by the at least one data processor, result in operations comprising the method of any of claims 1 to 23.
25. A non-transitory computer readable medium storing instructions, which when executed by the at least one data processor, result in operations comprising the method of any of claims 1 to 23.
26. A computer-implemented method, comprising:identifying an input molecule, where the input molecule comprises an amino acid residue sequence;generating, using a tokenization model, a multimodal representation of the input molecule,where the multimodal representation of the input molecule includes, for each amino acid residue in the amino acid residue sequence of the input molecule, a multimodal token encoding an identity and one or more backbone dihedral angles of the amino acid residue, andwhere the multimodal representation of the input molecule further includes a classification (CLS) token encoding a three-dimensional structure of the input molecule; modifying the amino acid residue sequence of the input molecule, where the modifying the amino acid residue sequence is guided by the multimodal representation of the input molecule; andgenerating an output molecule using the modified amino acid residue sequence of the input molecule.
27. The method of claim 26, wherein the modifying the amino acid residue sequence of the input molecule is guided by the multimodal representation of the input molecule such that88131251356v 1a three-dimensional structure of the output molecule is similar to the three-dimensional structure of the input molecule.
28. The method of claim 26, wherein the modifying the amino acid residue sequence of the input molecule is guided by the classification (CLS) token encoding the three-dimensional structure of the input molecule.
29. The method of claim 26, wherein the amino acid residue sequence of the input molecule is modified such that the modified amino acid residue sequence is assigned a same classification (CLS) token.
30. The method of claim 26, wherein the multimodal token of each amino acid residue encodes a phi (< ») angle, a psi (i ) angle, and an omega (m) angle comprising a backbone of the amino acid residue.
31. The method of claim 26, wherein the multimodal token representation of the input molecule is generated by at leastfor each amino acid residue in the amino acid residue sequence of the input molecule, combining the identity and the one or more backbone dihedral angles of the amino acid residue by element wise addition, andencoding a combination of the identity and the one or more backbone dihedral angles to generate the multimodal token of the amino acid residue.
32. The method of claim 26, further comprising:applying a property computation model to determine, based on at least a portion of the multimodal representation of the input molecule, one or more properties of the input molecule.
33. The method of claim 32, wherein the property computation model determines, based at least on the classification (CLS) token, the one or more properties of the input molecule.89131251356v 134. The method of claim 33, wherein the property computation model determines the one or more properties of the input molecule further based on the multimodal token of each amino acid residue in the amino acid residue sequence of the input molecule.
35. The method of claim of claim 26, further comprising:training the tokenization model to generate the multimodal representation of the input molecule.
36. The method of claim 35, wherein the tokenization model is trained to reconstruct one or more masked amino acid residue identities and / or masked backbone dihedral angles in a plurality of training samples.
37. The method of claim 36, wherein the plurality of training samples include at least one training sample comprising a protein molecule in which an identity or a backbone dihedral angle of some of a plurality of amino acid residues in the protein molecule are masked.
38. The method of claim 36, wherein the plurality of training samples include at least one training sample comprising a protein molecule in which a backbone dihedral angle of some of a plurality of amino acid residues in the protein molecule are masked with an exception of an identity of any of the plurality of amino acid residues.
39. The method of claim 36, wherein the plurality of training samples include at least one training sample comprising a protein molecule in which an identity of a plurality of amino acid residues in the protein molecule are masked with an exception of a backbone dihedral angle of any of the plurality of amino acid residues.
40. The method of claim 36, wherein the plurality of training samples include at least one training sample comprising a protein molecule in which a backbone dihedral angle of a plurality of amino acid residues in the protein molecule are masked with an exception of an90131251356v 1identity of any of the plurality of amino acid residues.
41. The method of claim 35, wherein the tokenization model is trained on a contrastive objective to generate similar multimodal representations for a positive pair of sample molecules having similar three-dimensional structures and dissimilar multimodal representations for a negative pair of sample molecules having dissimilar three-dimensional structures.
42. The method of claim 41, wherein the training the tokenization model includes identifying a pair of sample molecules as the positive pair based on at least on (i) the pair of sample molecules containing a same quantity of amino acid residues, and (ii) a dihedral angle distance between a three-dimensional structure of each sample molecule satisfying one or more thresholds.
43. The method of claim 42, wherein the training the tokenization model includes identifying the pair of sample molecules as the negative pair based on at least on at least one of (i) the pair of sample molecules containing different quantities of amino acid residues, and (ii) the dihedral angle distance between the three-dimensional structure of each sample molecule satisfying one or more different thresholds.
44. The method of claim 43, wherein the training the tokenization model includes identifying the pair of sample molecules as neither the positive pair nor the negative pair based at least on the dihedral angle distance between the three-dimensional structure of each sample molecule failing to satisfy the one or more thresholds or the one or more different thresholds, andexcluding the pair of sample molecules from the training the tokenization model.
45. The method of claim of claim 35, wherein the tokenization model is trained to determine a codebook including a plurality of discrete classification (CLS) tokens.91131251356v l46. The method of claim 26, wherein the input molecule comprises an antigen receptor, a variable domain, or a complementarity determining region (CDR).
47. The method of claim 26, wherein the input molecule includes a plurality of complementarity determining regions (CDRs) of an antibody, and wherein the multimodal representation of the input molecule is generated to include, for each complementarity determining region (CDR) of the plurality of plurality of complementarity determining regions (CDRs), a corresponding classification (CLS) token encoding a three-dimensional structure of the complementarity determining region (CDR).
48. The method of claim 26, wherein the tokenization model includes one or more transformer blocks.
49. A system, comprising:at least one data processor; andat least one memory storing instructions, which when executed by the at least one data processor, result in operations comprising the method of any of claims 26 to 48.
50. A non-transitory computer readable medium storing instructions, which when executed by the at least one data processor, result in operations comprising the method of any of claims 26 to 48.92131251356v 1