Systems and methods for determining and engineering PAM specificity of CAS proteins
The Protein2PAM model addresses the challenge of reduced specificity and activity in engineered Cas proteins by predicting PAM sequences, improving genome-editing efficiency and precision through customized PAM recognition.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- PROFLUENT BIO INC
- Filing Date
- 2025-12-26
- Publication Date
- 2026-07-02
AI Technical Summary
Existing methods for engineering Cas proteins with altered PAM recognition suffer from reduced specificity and activity, limiting their clinical utility in precise genome editing and personalized medicine applications.
A machine learning model, Protein2PAM, is used to predict PAM sequences for Cas proteins, enabling the generation of engineered Cas proteins with customized PAMs by iteratively generating and evaluating amino acid variants, leveraging structural and sequence information to maintain specificity and activity.
The model accurately predicts PAMs for diverse CRISPR systems, enhancing genome-editing efficiency and precision while achieving allelic specificity, facilitating the development of tailored Cas proteins for therapeutic targets.
Smart Images

Figure IMGF000053_0001 
Figure IMGF000054_0001 
Figure IMGF000058_0001_TABLE
Abstract
Description
[0001] PROF 43950.601
[0002] SYSTEMS AND METHODS FOR DETERMINING AND ENGINEERING PAM SPECIFICITY OF CAS PROTEINS
[0003] FIELD
[0004] Provided herein are Cas proteins with engineered PAM specificity and systems and methods for determining and engineering PAM specificity of Cas proteins.
[0005] CROSS-REFERENCE TO RELATED APPLICATIONS
[0006] This application claims the benefit of U.S. Provisional Application No. 63 / 739,011, filed December 26, 2024, the content of which is herein incorporated by reference in its entirety. SEQUENCE LISTING STATEMENT
[0007] The content of the electronic sequence listing titled PROF_43950_601_SequenceListing.xml (Size: 61,276 bytes; and Date of Creation: December 19, 2025) is herein incorporated by reference in its entirety.
[0008] BACKGROUND
[0009] In natural CRISPR systems, Protospacer Adjacent Motif (PAM) recognition enables Cas enzymes to distinguish between bacterial 'self DNA and invading 'non-self DNA. As a result of b acted a-phage coevolution, Cas proteins have diversified into subtypes capable of recognizing a broad range of PAM sequences. During CRISPR interference, PAM binding is essential for initiating DNA unwinding, R-loop formation, and facilitating an efficient target-search process. In genome editing, the PAM mediates specificity but also restricts target availability, posing a challenge for applications where precise Cas positioning is vital, such as base editing, homology-directed repair, or allele-specific cleavage for gene disruption. In the latter scenario, the stringent requirements for PAM recognition, compared to the more flexible gRNA-target pairing, enable the PAM to be used for single-nucleotide allele discrimination and precise targeting of ‘dominant-negative’ disease-associated mutations.
[0010] A variety of approaches have been developed to engineer Cas proteins with altered or relaxed PAM recognition, each utilizing distinct techniques to modify protein-PAM interactions. Structure-based approaches have been effective, such as mutating key PAM-interacting residues of SpCas9 to broaden or shift its PAM recognition. Experimental evolution methods, such as phage-assisted continuous evolution (PACE), have also been employed to broaden PAM specificity. Additionally, near-PAMless nucleases have been developed by engineeringPROF 43950.601
[0011] conformational flexibility into the PAM-interacting domain. However, many of these PAM relaxed or engineered enzymes exhibit reduced specificity and lower activity, limiting their clinical utility. Thus, there is still a need for a robust and facile method to generate engineered Cas proteins with customized PAMs for specific therapeutic targets and scalable personalized medicine approaches, while retaining suitable specificity and activity.
[0012] SUMMARY
[0013] Provided herein are methods for predicting a Protospacer Adjacent Motif (PAM) sequence for a Cas protein. In some embodiments, the methods comprise determining the probability distribution for each nucleotide at two or more PAM positions for a Cas protein of interest using a machine learning model trained on a database of Cas protein sequences and corresponding PAM sequences.
[0014] In some embodiments, structural information of the Cas protein of interest is not input into the machine learning model. In some embodiments, the machine learning model comprises a trained transformer encoder and a 2-layer multi-layer perceptron (MLP) head.
[0015] In some embodiments, Cas protein sequences comprise PAM-interacting domain (PID) sequences. In some embodiments, the machine learning model is weighted with sequences similar to sequence of the Cas protein of interest.
[0016] In some embodiments, the methods further comprise generating one or more variants of the Cas protein of interest and predicting the PAM sequence for each of the one or more variants. In some embodiments, the one or more variants comprise an amino acid sequence having one or more amino acid substitutions, deletions, or additions as compared to the Cas protein of interest amino acid sequence. In some embodiments, the one or more variant sequences are generated iteratively over multiple rounds of generating and predicting, wherein predicted PAMs from each round are leveraged to target a desired PAM sequence.
[0017] In some embodiments, the methods comprise distinguishing high-confidence and low-confidence PAM predictions. In some embodiments, the distinguishing comprises quantifying confidence based on protein language model (pLM) embeddings and sequence identity as compared to sequences used to train the machine learning model.
[0018] In some embodiments, the methods further comprise identifying PAM-specifying mutations (PSMs) in the Cas protein of interest. In some embodiments, the methods further comprise predicting PAM-specifying mutations in the Cas protein of interest.PROF 43950.601
[0019] Also provided are methods for training a machine learning model for predicting a Protospacer Adjacent Motif (PAM) sequence for a Cas protein. In some embodiments, the methods comprise training a machine learning model with input training data comprising full sequences and / or PAM-interacting domain (PID) sequences of a plurality of Cas proteins and their known PAM sequences. In some embodiments, the Cas proteins are natural Cas proteins. In some embodiments, the Cas proteins are engineered Cas proteins. In some embodiments, the plurality of Cas proteins comprises Cas proteins from a single type of CRISPR systems. In some embodiments, the plurality of Cas proteins comprises Cas proteins from two or more types of CRISPR systems.
[0020] In some embodiments, the PAM for any of the Cas proteins or PID sequences was known, determined experimentally, or modeled. In some embodiments, the input training data is from high-throughput screening of Cas protein variants and their related PAMs.
[0021] In some embodiments, the input training data is upweighted to those sequences similar to a Cas protein of interest. In some embodiments, the sequences similar to a Cas protein of interest are from homologs (e.g., orthologs, paralogs) of a Cas protein of interest.
[0022] In some embodiments, the input training data does not comprise structural information for the plurality of Cas proteins. In some embodiments, the input training data comprises structural information for the plurality of Cas proteins.
[0023] Further provided are non-transitory computer readable media containing instructions that, when executed by at least one processor, perform the methods disclosed herein and systems comprising the at least one non-transitory computer readable medium and at least one processor.
[0024] Additionally provided are engineered Cas proteins having an amino acid sequence with at least 70% identity to SEQ ID NO: 1. In some embodiments, the engineered Cas protein comprises one or more amino acid substitution, deletion, and / or addition as compared to SEQ ID NO: 1, wherein the one or more amino acid substitution, deletion, and / or addition modify Protospacer Adjacent Motif (PAM) recognition and / or specificity. In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions, deletions, or additions in the PAM-interacting domain.
[0025] In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions at positions: 938, 940, 954, 956, 957, 960, 966, 979, 980, 981, 982, 987, 989, 990, 995, 996, 1000, 1003, 1007, 1008, 1010, 1011, 1014, 1015, 1016, 1017, 1018, 1020, 1021, 1022,PROF 43950.601
[0026] 1023, 1024, 1026, 1027, 1029, 1030, 1031, 1032, 1034, 1037, 1038, 1039, 1040, 1041, 1044, 1046, 1048, 1050, 1053, 1055, and 1056, relative to SEQ ID NO: 1. In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions selected from: H938D or H938G; G940A; E954C; G956A; D957G; Y960H; S966A; V979I or V979C; V980K or V980I; Q981A, Q981T, Q981R, or Q981G; G982F or G982Y; D987N; Q989T; L990V; F995Y; N996Q, N996E, orN996K; S1000V; P1003S; V1007I; E1008K; I1010K or I1010T; T1011A; A1014N, A1014D, or A1014K; R1015S; M1016F, M1016K, M1016R, or M1016I; F1017L; G1018A; F1020Y; A1021S, A1021V, or A1021I; S1022G or S1022N; C1023L or C1023F; H1024D or H1024E; G1026E, G1026S, or G1026A; T1027N or T1027D; N1029A, N1029G, N1029H, or N1029R; I1030F; N1031S orN1031D; I1032L; I1034T; L1037K; D1038E; H1039K or H1039N; K1040T or K1040S; I1041K or I1041V; N1044D; I1046V; E1048R or E1048Q;
[0027] I1050V; K1053Q; A1055L; and L1056V, relative to SEQ ID NO: 1.
[0028] In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions at positions: 981, 1024, and 1029, relative to SEQ ID NO: 1. In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions selected from:
[0029] Q981A; H1024D or H1024E; andN1029A orN1029G, relative to SEQ ID NO: 1.
[0030] In some embodiments, the engineered Cas protein further comprises one or more amino acid substitutions at positions: 957, 982, 1026, and 1048. In some embodiments, the engineered Cas protein further comprises one or more amino acid substitutions selected from: D957G;
[0031] G982F or G982Y, G1026S or G1026A; and E1048R or E1048Q.
[0032] In some embodiments, the engineered Cas protein comprises an amino acid sequence with at least 70% identity to any of SEQ ID NOs: 2-31. In some embodiments, the engineered Cas protein comprises an amino acid sequence with at least 90% identity to any of SEQ ID NOs: 2-31. In some embodiments, the engineered Cas protein comprises any of SEQ ID NOs: 2-31.
[0033] Also provided are fusion proteins comprising an engineered Cas protein as described herein and one or more effector domains, nucleic acids encoding the engineered Cas protein or fusion proteins as described herein, and system comprising the engineered Cas protein or fusion proteins as described herein and at least one guide RNA, or one or more nucleic acids encoding thereof.
[0034] In addition, methods of modifying a target nucleic acid are provided herein. In some embodiments, the methods comprise contacting the target nucleic acid with engineered CasPROF 43950.601
[0035] protein and / or fusion protein as described herein, and at least one guide RNA, or one or more nucleic acids encoding thereof.
[0036] Other aspects and embodiments of the disclosure will be apparent in light of the following detailed description.
[0037] BRIEF DESCRIPTION OF FIGURES FIGS. 1A-1E. Landscape of PAM diversity across CRISPR-Cas systems. FIG. lAis an exemplary pipeline for in silico characterization of PAMs for diverse CRISPR-Cas systems. Cas proteins responsible for PAM recognition during target inference are shown. Cas9 and Cas 12 are single protein effectors while Cas8 is part of the multi-subunit Cascade complex. FIG. IB shows the fraction of CRISPR-Cas operons associated with a PAM prediction. FIG. 1C is accumulation curves of PAM diversity with increasing data volume. Discovery of new PAMs has largely plateaued for Type I and II systems. FIG. ID shows PAM similarity compared between Cas proteins with different levels of relatedness. PAM similarity rapidly diverges for Type II systems but is highly conserved for Types I and V. FIG. IE is a phylogenetic tree of Cas9 proteins clustered at 70% identity using MMseqs2. Outer rings indicate the information content at each of the first 9 PAM positions. Phylogenetic tree built using FastTree and visualized using iToL.
[0038] FIGS. 2A-2E. A machine learning framework to predict CRISPR-Cas PAMs. FIG. 2A is the Protein2PAM model architecture consisting of a pre-trained 650-million-parameter transformer encoder, followed by a 2-layer multi-layer perceptron (MLP) head responsible for predicting PAM nucleotide probabilities. FIG. 2B is the architecture of model for quantifying Protein2PAM’s confidence in its own predictions, incorporating both protein language model (pLM) embeddings and distance to training sequences. FIG. 2C shows PAM prediction accuracy for proteimPAM pairs held back from the CRISPR-Cas Atlas training dataset. FIG. 2D shows prediction accuracy for Cas proteins with experimentally characterized PAMs. FIG. 2E shows PAM prediction accuracy for 79 diverse Cas9 orthologs experimentally characterized in Gasiunas et al. (Nat Commun. 2020;ll: 5512). Representative examples are indicated below the barplot. In all panels, PAM prediction accuracy was measured using cosine similarity.
[0039] FIGS. 3A-3G. Protein2PAM rapidly and sensitively predicts PAMs for natural CRISPR-Cas systems. FIG. 3 A shows an exemplary workflow in which CCtyper was used to identify Cas operons from newly sequenced genomes and metagenomes. Individual CRISPR arrays were used to predict PAMs with PAMpredict while individual Cas proteins were used to predict PAMs withPROF 43950.601
[0040] Protein2PAM. High-confidence PAMpredict prediction were determined from at least 10 mapped CRISPR spacers while high-confidence protein2PAM prediction are determined from model confidence scores >0.80. FIG. 3B shows count of CRISPR-Cas operons and isolated Cas proteins (Cas8, Cas9, Casl2) identified by CCtyper. FIG. 3C shows fraction of CRISPR-Cas operons with a high confidence PAM prediction using PAMpredict and Protein2PAM. FIG. 3D shows total number of high-confidence PAM predictions across methods for CRISPR-Cas operons and isolated Cas proteins, respectively. In FIGS. 3E-3G, PAM logos were compared between Protein2PAM and PAMpredict using the cosine similarity metric. FIG. 3E shows PAMpredict predictions are highly concordant with Protein2PAM when PAMpredict predictions are determined from at least 10-15 mapped spacers. FIG. 3F shows Protein2PAM predictions with a confidence score >0.80 are highly concordant with PAMpredict. FIG. 3G shows a comparison of running times between Protein2PAM and PAMpredict.
[0041] FIGS. 4A-4G. In silico mutagenesis pinpoints protein-PAM interactions. FIG. 4A is a schematic showing Protein2PAM models were used to predict the effect of millions of single amino acid variants introduced to diverse Cas proteins. PAM-specifying amino acid mutations (PSMs) were predicted for mutations that changed the PAM by at least 0.5 bits at one or more PAM positions. FIG. 4B is bar plots summarizing PSMs. While most mutations have no effect, most Cas9 proteins have a predicted PSM. Nearly all PSMs are located in the PI domain and are identified more sensitively using the PID-only Protein2PAM model. Substitutions are more effective than insertions and deletions at changing the PAM. Double amino acid variants expand the number of PSMs for NmelCas9. FIG. 4C, top, is a Volcano plot showing amino acids enriched at PSMs compared to their background distributions. Glutamine and arginine are notably overrepresented at PSMs. FIG. 4C, bottom, is a Volcano plot displaying the bias in PSM amino acid occurrences in AT -rich versus GC-rich contexts. Glutamine PSMs preferentially interact with AT nucleotides while arginine PSMs preferentially interact with GC nucleotides. FIG. 4D is scatterplots depicting the distribution of mutational effects across eight Cas9 and Cas 12 proteins. The y-axis indicates the maximum change in the PAM across all 20 mutations at each position. Shaded regions indicate PAM-interacting domains. FIG. 4E is a scatterplot indicating the cumulative effect of single amino acid substitutions for different Cas proteins. Several proteins are predicted to be highly engineerable by Protein2PAM, including AceCas9 and NmelCas9. FIG. 4F shows the distribution of PSMs across amino acid positions forPROF 43950.601
[0042] NmelCas9. FIG. 4G shows the protein structure of NmelCas9 superimposed with model predictions. Each amino acid position is colored by the maximum change in the PAM across all 20 mutations at the given position. Residues harboring PSMs are colored red, are located in the PI domain, and make hydrogen bonds with PAM DNA. Protein structure was visualized using PyMOL.
[0043] FIGS. 5A-5D. Phylogenetic distribution of PAMs from the CRISPR-Cas Atlas. FIG. 5 A shows phylogenetic trees built for Cas8, Cas9, and Casl2 proteins. Proteins were first clustered using MMseqs2 at 70% identity for Cas8 and Cas9 and at 95% identity for Casl2. Phylogenetic trees were built using FastTree and visualized using iToL. Colored strips indicate the information content at PAM positions. FIG. 5B shows the distribution of high-information content positions across PAMs from Type I, II, and V systems. In Type I systems, the PAM is predominantly restricted to positions -1 to -3 relative to the protospacer, while in Type II systems, the distribution of high information content PAM positions is more variable. FIG. 5C shows the distribution of the number of spacers aligned to virus and plasmid genomes for each CRISPR-Cas cluster. FIG. 5D shows the signal-to-noise ratio comparing nucleotide conservation upstream and downstream of the protospacer. In Type II systems, a downstream motif is expected, while in Type I and V systems, the motif is upstream. PAM predictions are based on a high number of aligned CRISPR spacers, resulting in strong signal-to-noise ratios.
[0044] FIG. 6. Loss curves for training Protein2PAM models. Each model predicts nucleotide distributions at 10 PAM positions based on Cas protein sequences. The architecture integrates a pre-trained 650M-parameter ESM-2 transformer encoder and a 2-layer MLP head. The Type I and V models very quickly converge to their minimum loss, while the Type II model takes much longer to optimize. These training dynamics mirror the cross-validation results in FIG. 2C and show that it is much more challenging to accurately predict PAMs for Type II systems than it is for Type I or Type V systems.
[0045] FIGS. 7A-7B. Cross-validation accuracy for Protein2PAM models. FIG. 7A shows PAM prediction accuracy. Each panel indicates the cosine similarity between true and predicted PAMs as a function of distance from the training data. Neural models consistently outperform a baseline in which a sequence is assigned the PAM of the nearest neighbor in the training dataset. PID-only models outperform full-sequence models for Cas9. FIG. 7B shows confidence prediction accuracy. True positive rate, true negative rate, and balanced accuracy whenPROF 43950.601
[0046] determining if a PAM prediction result is high confidence or not. High confidence predictions are defined as those with accuracy above 0.80, while accuracy is defined as the cosine similarity between predicted and true PAM. Especially for Type II systems, the confidence model accurately discriminates between accurate and inaccurate Protein2PAM predictions.
[0047] FIG. 8. Concordance with experimentally determined PAMs for diverse Type I CRISPR systems. Wimmer et al. (Mol Cell. 2022;82: 1210-1224. e6) characterized PAMs diverse Type I systems using a rapid cell-free protocol, PAM-DETECT. Top panels show nucleotide-enrichment plots from Wimmer et al. for selected Type I systems subjected to PAM-DETECT. Bottom panels show Protein2PAM predictions for the corresponding Cas8 orthologs. For two Type I-E systems, the cell-free assay failed to identify a PAM due to low binding affinity, whereas Protein2PAM was able to confidently predict both PAMs as AAG.
[0048] FIGS. 9A-9F. Concordance with experimentally determined PAMs for diverse Type V CRISPR systems. Protein2PAM was applied to 45 Casl2 proteins from 12 different published studies. Protein2PAM predictions were compared to the experimentally determined PAMs using the cosine similarity metric. Figure panels indicate Protein2PAM predictions for 27 of 45 Casl2 proteins. FIG. 9A shows the evaluation of PAM predictions for Casl2f proteins. FIG. 9B shows the evaluation of PAM predictions for Casl2a proteins. FIG. 9C shows the evaluation of a PAM prediction for Casl2j proteins. FIG. 9D shows the evaluation of PAM predictions for Casl2b proteins. FIG. 9E shows the evaluation of PAM predictions for Casl2k proteins. FIG. 9F shows the evaluation of PAM predictions for Casl2i and Casl2h proteins.
[0049] FIG. 10. Protein language model scores for Nmel (Neisseria meningitidis) mutants. Progen2 fine-tuned on the CRISPR-Cas Atlas was applied to different types of single mutants (left) or mutants with up to five substitutions (right). The language model predicts a decrease in fitness with mutational load and with indels.
[0050] FIG.11. PAM specifying amino acids make specific hydrogen bonds with PAM nucleotides. Four Cas9 crystal structures are shown. Positions are colored by the maximum predicted change to the PAM, in bits, after in silico saturation mutagenesis and evaluation with Protein2PAM. Top ranked positions are shown in bar plots that result in the greatest predicted change in the PAM after in silico saturation mutagenesis. Positions highlighted in red make hydrogen bonds with specific PAM nucleotides in the corresponding crystal structures.PROF 43950.601
[0051] FIGS. 12A-12C. Structure-based PAM prediction using ProseLM. Experimentally validated Nmel Cas9 orthologs from Wei et al. (Elife, 11, August 2022.) were structurally modeled with using AlphaFold2 as implemented in ColabFold and aligned to the catalytic-state Nmel PID crystal structure (PDB 6JDV). Orthologs with poor structural alignment in the PI domain (RMSD > 4 A or pLDDT < 80) were excluded, yielding 16 Cas9s. For each, structural models of the PID bound to all 16 possible N4NNTT PAM variants were generated by mutating PAM positions 5 to 6 in PyMol and aligning the predicted PID to the Nmel PID-DNA complex. ProseLM (Jeffrey A. Ruffolo, et al. Adapting protein language models for structure-conditioned design. bioRxiv, 2024.), a structure-conditioned protein language model, was used to score each PID-PAM complex. PAM preferences were inferred either by converting log-likelihoods to information-content logos or by selecting the highest-scoring PAM. FIG. 12A is a histogram of the maximum information content at each PAM position across all orthologs. ProseLM assigned similar likelihoods to all 16 PAMs, yielding uniformly low information content. FIG. 12B is representative PAM logos for four orthologs derived from ProseLM log-likelihoods. FIG. 12C is a table comparing the highest-scoring predicted PAM with the experimentally determined consensus for each ortholog. ProseLM recovered the correct PAM in only 4 of 16 cases.
[0052] FIGS. 13A-13D. Protein2P AM-guided engineering of PAM recognition across diverse Cas9 scaffolds. To identify Cas9 proteins amenable to PAM reprogramming, results from in silico single-site mutagenesis (SSM) of 336 Cas9 PAM-interacting domains (PIDs) were analyzed. Mutations that produced an altered PAM with any nucleotide position exhibiting an LI distance > 0.25 bits from the wild-type PAM were identified. All nucleotide changes passing this threshold were recorded and used to enumerate all possible 10-nt designable PAMs for each Cas9. From these, 10 phylogenetically diverse Cas9 scaffolds were selected with the largest number of designable PAMs and multiple sequence alignments (MSAs) containing > 20 non-redundant sequences (90 % identity dereplication), ensuring deep and diverse sequence space for MCMC sampling. For each scaffold, 16 target PAMs were chose, prioritizing single- and dinucleotide PAMs. 109 MCMC trajectories were run for each target PAM. A trajectory was defined as successful if any sequence achieved a Protein2P AM-predicted similarity > 0.866 to the target PAM. FIG. 13 A is an exemplary workflow illustrating identification of Cas9 scaffolds and target PAMs, construction of MSAs, and Protein2P AM-guided MCMC design of PAM variants. FIG. 13B is a strip plot showing success rates for each target PAM across all 10 Cas9PROF 43950.601
[0053] scaffolds, where each point represents the fraction of successful trajectories for a given target. FIG. 13C is a bar plot showing the fraction of target PAMs per scaffold with at least one successful trajectory. All 10 scaffolds achieved successful designs for > 25 % of their target PAMs, with some approaching 100 %. FIG. 13D is Protein2PAM-predicted PAM logos for Cas9 enzyme “3300008083_542_9118”. For each target PAM, the highest-scoring prediction is displayed. All targets were successfully designed in silico except NNTNNA (bottom left).
[0054] FIG. 14. Comparison of editing efficiencies for PAM-engineered Cas9 variants and SpCas9 across endogenous human genomic targets with PAMs compatible for both enzymes. Indel rates were quantified by targeted sequencing, with each point representing a distinct target site and error bars showing the standard deviation across up to three replicates.
[0055] DETAILED DESCRIPTION
[0056] Machine learning, combined with large evolutionary-scale datasets, provides an opportunity to uncover the relationship between Cas proteins and their PAMs, enabling the efficient engineering of Cas proteins with altered PAM specificity without relying on biophysical modeling or labor-intensive iterative screening, thus accelerating the development of tailored Cas proteins. As disclosed herein, a machine learning framework that accurately predicts PAMs for diverse CRISPR systems, Protein2PAM, was developed.
[0057] Protein2PAM is a deep learning model capable of predicting PAM specificity directly from a Cas protein sequence. Trained on a comprehensive dataset of Cas proteins paired with their natural PAMs, Protein2PAM effectively learned the relationship between Cas effectors and PAM recognition sequences and accurately predicted PAMs for diverse Type I, Type II, and Type V CRISPR systems, aligning with previously reported in vitro assay results and demonstrating four-fold greater sensitivity than existing in silico methods. To engineer PAM-customized genome editors, Protein2PAM was applied to millions of amino acid variants across diverse Cas proteins. For Cas9, the model accurately identified residues known to interact with PAM nucleotides and revealed thousands of new protein-PAM interactions across enzymes lacking crystal structures. Using Protein2PAM, NmelCas9 variants were engineered to recognize alternate PAMs, confirming the model's ability to customize PAM specificity without structural modeling or iterative screening.
[0058] The disclosed methods offer a powerful tool for characterizing PAMs and engineering Cas protein-DNA interactions across CRISPR systems. The methods and system disclosed hereinPROF 43950.601
[0059] can be used to engineer a collection of enzyme variants that target a wide range of PAMs while preserving specificity, a challenge with existing PAM-relaxation methods. The engineered enzymes would facilitate enhanced genome-editing efficiency and precision across the human genome, while achieving allelic specificity at individual disease loci. The methods and systems disclosed herein can be expanded by incorporating experimental data from high-throughput screening of enzyme variants and integrating structural information to improve PAM engineering by modeling residues at DNA-binding interfaces. This would enable the engineering of enzymes where the protein or PAM is conserved in nature, broadening the range of Cas proteins that can be effectively engineered. The methods and system disclosed herein may also be adapted to other DNA-binding proteins, including recombinases, transposases, and zinc fingers.
[0060] Section headings as used in this section and the entire disclosure herein are merely for organizational purposes and are not intended to be limiting.
[0061] Definitions
[0062] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. For example, any nomenclature used in connection with, and techniques of cell and tissue culture, molecular biology, microbiology, genetics and protein and nucleic acid chemistry and hybridization described herein are those that are well known and commonly used in the art. In case of conflict, the present document, including definitions, will control. Preferred methods and materials are described below, although methods and materials similar or equivalent to those described herein can be used in practice or testing of the present disclosure. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. The materials, methods, and examples disclosed herein are illustrative only and not intended to be limiting.
[0063] As used herein, terms and phrases such as “having,” “may have,” “include,” or “may include” a feature (such as a number, function, operation, or component, such as a component) indicate the presence of that feature, and do not preclude the presence of other features. Further, as used herein, the phrase “a or B,” “at least one of a and / or B,” or “one or more of a and / or B” may include all possible combinations of a and B. For example, “a or B,” “at least one of a and B,” and “at least one of a or B” may indicate all of the following: (1) comprises at least one A, (2) comprises at least one B, or (3) comprises at least one A and at least one B. Furthermore, asPROF 43950.601
[0064] used herein, the terms “first” and “second” may modify various components without regard to importance, and do not limit the components. These terms are only used to distinguish one component from another. For example, the first user device and the second user device may indicate user devices that are different from each other regardless of the order or importance of the devices. A first component may be termed a second component, and vice versa, without departing from the scope of the present disclosure.
[0065] For the recitation of numeric ranges herein, each intervening number there between with the same degree of precision is explicitly contemplated. For example, for the range of 6-9, the numbers 7 and 8 are contemplated in addition to 6 and 9, and for the range 6.0-7.0, the number 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, and 7.0 are explicitly contemplated.
[0066] It will be understood that when an element (such as a first element) is referred to as being (operatively or communicatively) “coupled / coupled” or “connected / connected” to another element (such as a second element), it can be directly coupled or connected / coupled or connected to the other element (such as the second element) or via a third element. Conversely, it will be understood that when an element (such as a first element) is referred to as being “directly coupled” / ” directly coupled to” or “directly connected” / ” directly connected” to another element (such as a second element), there is no other element (such as a third element) intervening between the element and the other element.
[0067] As used herein, the phrase “configured (or set) to” may be used interchangeably with the phrases “adapted to,” “having ... capability,” “designed to,” “adapted to,” “made to,” or “capable,” as the case may be. The phrase “configured (or set) to” does not substantially mean “specially designed in hardware.” Rather, the phrase “configured to” may indicate that a device is capable of performing an operation with another device or component. For example, the phrase “a processor configured (or arranged) to perform A, B and C” may refer to a general-purpose processor (such as a CPU or an application processor) or a special-purpose processor (such as an embedded processor) that may perform operations by executing one or more software programs stored in a memory device.
[0068] The various functions described below may be implemented or supported by one or more computer programs, each formed from computer-readable program code and embodied in a computer-readable medium. The terms “application” and “program” refer to one or more computer programs, software components, sets of instructions, procedures, functions, objects,PROF 43950.601
[0069] classes, instances, related data, or a portion thereof adapted for implementation in suitable computer readable program code.
[0070] As used herein, the term “computer” refers to a machine, apparatus, or device that is capable of accepting and performing logic operations from software code. The term “application,” “software,” “software code” or “computer software” refers to any set of instructions operable to cause a computer to perform an operation. Software code may be operated on by a “rules engine” or “processor.” Thus, in some embodiments, the methods and systems of the present invention may be performed by a computer or computing device having a processor based on instructions received by computer applications and software.
[0071] The term “computer readable medium” as used herein refers to any medium that participates in providing instructions to the processor for execution. A computer readable medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical, magnetic disks, and magneto-optical disks, such as the hard disk or the removable media drive. Volatile media includes dynamic memory, such as the main memory. Transmission media includes coaxial cables, copper wire, and fiber optics, including the wires that make up the bus. Transmission media may also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. Non-transitory computer readable media includes all computer readable media, with the sole exception being a transitory, propagating signal per se.
[0072] As used herein the term “data network” or “network” shall mean an infrastructure capable of connecting two or more computers such as client devices either using wires or wirelessly allowing them to transmit and receive data. Non-limiting examples of data networks may include the Internet or wireless networks which may include Wi-Fi and cellular networks. For example, a network may include a local area network (LAN), a wide area network (WAN) (e.g., the Internet), a mobile relay network, a metropolitan area network (MAN), an ad hoc network, a telephone network (e.g., a Public Switched Telephone Network (PSTN)), a cellular network, a Zigby network, or a voice-over-IP (VoIP) network.
[0073] As used herein, the term “database” shall generally mean a digital collection of data or information. For the purposes of the present disclosure, a database may be stored on a remote server and accessed by a client device (e.g., through the Internet) or alternatively in some embodiments the database may be stored on the client device or remote computer itself.PROF 43950.601
[0074] As used herein, “nucleic acid” or “nucleic acid sequence” refers to a polymer or oligomer of pyrimidine and / or purine bases, preferably cytosine, thymine, and uracil, and adenine and guanine, respectively (See Albert L. Lehninger, Principles of Biochemistry , 793-800 (Worth Pub.
[0075] 1982)). The present technology contemplates any deoxyribonucleotide, ribonucleotide, or nucleoprotein component, and any chemical variants thereof, such as methylated, hydroxymethylated, or glycosylated forms of these bases, and the like. The polymers or oligomers may be heterogenous or homogenous in composition and may be isolated from naturally occurring sources or may be artificially or synthetically produced. In addition, the nucleic acids may be DNA or RNA, or a mixture thereof, and may exist permanently or transitionally in single-stranded or double-stranded form, including homoduplex, heteroduplex, and hybrid states. In some embodiments, a nucleic acid or nucleic acid sequence comprises other kinds of nucleic acid structures such as, for instance, a DNA / RNA helix, peptide nucleic acid (PNA), morpholino nucleic acid (see, e.g., Braasch and Corey, Biochemistry, 41(14): 4503-4510 (2002) and U.S. Pat. No. 5,034,506), locked nucleic acid (LNA; see Wahlestedt et al., Proc. Natl. Acad. Sci. U.S.A., 97: 5633-5638 (2000)), cyclohexenyl nucleic acids (see Wang, J. Am. Chem. Soc., 122: 8595-8602 (2000)), and / or a ribozyme. Hence, the term “nucleic acid” or “nucleic acid sequence” may also encompass a chain comprising non-natural nucleotides, modified nucleotides, and / or non- nucleotide building blocks that can exhibit the same function as natural nucleotides (e.g., “nucleotide analogs”); further, the term “nucleic acid sequence” as used herein refers to an oligonucleotide, nucleotide or polynucleotide, and fragments or portions thereof, and to DNA or RNA of genomic or synthetic origin, which may be single or double-stranded, and represent the sense or antisense strand. The terms “nucleic acid,” “polynucleotide,” “nucleotide sequence,” and “oligonucleotide” are used interchangeably. They refer to a polymeric form of nucleotides of any length, either deoxyribonucleotides or ribonucleotides, or analogs thereof.
[0076] A “vector” or “expression vector” is a replicon, such as plasmid, phage, virus, or cosmid, to which another DNA segment, e.g., an “insert,” may be attached or incorporated so as to bring about the replication of the attached segment in a cell.
[0077] As used herein, “peptide,” “polypeptide,” or “protein” refer to a sequence of two or more amino acids linked by peptide bonds. The polypeptide can be natural, synthetic, or a modification or combination of natural and synthetic. The peptide or polypeptide may be modified by the addition of sugars, lipids or other moieties not included in the amino acid chain.PROF 43950.601
[0078] The terms “polypeptide,” “oligopeptide,” and “peptide” are used interchangeably herein. The peptide(s) may be produced by recombinant genetic technology or chemical synthesis. The peptide(s) may be isolated and purified by any number of standard methods including, but not limited to, differential solubility (e.g., precipitation), centrifugation, chromatography (e.g., affinity, ion exchange, and size exclusion), or by any other standard techniques known in the art.
[0079] The term “amino acid” or “any amino acid” as used here refers to any and all amino acids, including naturally occurring amino acids (e.g., a-amino acids), unnatural amino acids, modified amino acids, and non-natural amino acids. It includes both D- and L-amino acids. Natural amino acids include those found in nature, such as, e.g., the 23 amino acids that combine into peptide chains to form the building-blocks of a vast array of proteins. These are primarily L stereoisomers, although a few D-amino acids occur in bacterial envelopes and some antibiotics. For the most part, the names of naturally occurring and non-naturally occurring aminoacyl residues used herein follow the naming conventions suggested by the IUPAC Commission on the Nomenclature of Organic Chemistry and the IUPAC-IUB Commission on Biochemical Nomenclature as set out in “Nomenclature of a- Amino Acids (Recommendations, 1974)” Biochemistry, 14(2), (1975). To the extent that the names and abbreviations of amino acids and aminoacyl residues employed in this specification and appended claims differ from those suggestions, they will be made clear to the reader. Throughout the present specification, unless naturally occurring amino acids are referred to by their full name (e g., alanine, arginine, etc.), they are designated by their conventional three-letter or single-letter abbreviations (e.g., Ala or A for alanine, Arg or R for arginine, etc.). The term “L-amino acid,” as used herein, refers to the “L” isomeric form of a peptide, and conversely the term “D-amino acid” refers to the “D” isomeric form of a peptide (e.g., Dphe, (D)Phe, D-Phe, orUF for the D isomeric form of Phenylalanine). Amino acid residues in the D isomeric form can be substituted for any L-amino acid residue, as long as the desired function is retained by the peptide.
[0080] Nucleic acid or amino acid sequence “identity,” as described herein, can be determined by comparing a nucleic acid or amino acid sequence of interest to a reference nucleic acid or amino acid sequence. A number of mathematical algorithms for obtaining the optimal alignment and calculating identity between two or more sequences are known and incorporated into a number of available software programs. Examples of such programs include CLUSTAL-W, T-Coffee, and ALIGN (for alignment of nucleic acid and amino acid sequences), BLAST programsPROF 43950.601
[0081] (eg., BLAST 2.1, BL2SEQ, and later versions thereof) and FASTA programs (e.g., FASTA3x, FAS™, and SSEARCH for sequence alignment and sequence similarity searches). Sequence alignment algorithms also are disclosed in, for example, Altschul et al., J. Molecular Biol., 215(3): 403-410 (1990), Beigert et al., Proc. Natl. Acad. Sci. USA, 106(\Q): 3770-3775 (2009), Durbin et al., eds., Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, Cambridge, UK (2009), Soding, Bioinformatics, 27(7): 951-60 (2005), Altschul et al., Nucleic Acids Res., 25(\T): 3389-3402 (1997), and Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press, Cambridge UK (1997)).
[0082] The terms “non-naturally occurring,” “engineered,” and “synthetic” are used interchangeably and indicate the involvement of the hand of man. The terms, when referring to nucleic acid molecules or polypeptides mean that the nucleic acid molecule or the polypeptide is at least substantially free from at least one other component with which it is naturally associated in nature and as found in nature, and / or the nucleic acid molecule or the polypeptide is associated with at least one other component with which it is not naturally associated in nature and / or that there is one or more changes in nucleic acid or amino acid sequence as compared with such sequence as it is found in nature and / or that the nucleic acid or polypeptide sequence was generated de novo, e.g., not based on or derived from any naturally occurring sequence.
[0083] A “parent sequence” as used herein means a sequence comprising a region or residue that is unmodified in relationship to the position being modified in a variant or engineered sequence. By “position” as used herein is meant a location in the sequence of a protein. Positions may be numbered sequentially, or according to an established format.
[0084] Definitions for other specific words and phrases may be provided throughout this patent document. Those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
[0085] None of the description in this application should be read as implying that any particular element, step, or function is an essential element which must be included in the claim scope. The scope of patented subject matter is defined only by the claims. Any other term used in the claims, including, but not limited to, “mechanism,” “module,” “device,” “unit,” “assembly,” “element,”PROF 43950.601
[0086] “member,” “apparatus,” “machine,” “system,” “processor,” or “controller,” is understood by the applicants to refer to structures known to those of ordinary skill in the relevant art.
[0087] Preferred methods and materials are described below, although methods and materials similar or equivalent to those described herein can be used in practice or testing of the present disclosure. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. The materials, methods, and examples disclosed herein are illustrative only and not intended to be limiting.
[0088] The terms and phrases used herein are used only to describe some embodiments of the present disclosure and do not limit the scope of other embodiments of the present disclosure. It is to be understood that the singular includes plural referents unless the context clearly dictates otherwise. All terms and phrases used herein (including technical and scientific terms and phrases) have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments of the present disclosure belong. It will be further understood that terms and phrases, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. In some instances, the terms and phrases defined herein may be construed to exclude embodiments of the disclosure.
[0089] PAM Prediction and Engineering
[0090] Disclosed herein are systems and methods for Protospacer Adjacent Motif (PAM) prediction and engineering. In particular, the present disclosure provides systems and methods to predict the PAM recognized by a Cas protein based on a machine learning model trained on amino acid residues utilized for Cas protein PAM recognition. In some embodiments, the methods comprise identifying PAM-specifying mutations (PSMs) in a known Cas protein sequence as those causing a shift a one or more nucleotide positions. In some embodiments, the methods predict PAM-specifying mutations in the input Cas protein sequence.
[0091] The method for PAM prediction, Protein2PAM, leverages a Cas protein sequence as input and predicts the probability distribution for each nucleotide at one or more PAM positions (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 PAM positions). The PAM is modeled as a sequence of one or more probability vectors over the nucleotides A, C, G, and T, located either upstream or downstream of the guide sequence. The Protein2PAM model architecture comprises of a pre-PROF 43950.601
[0092] trained 650-million-parameter ESM-2 transformer encoder, followed by a 2-layer multi-layer perceptron (MLP) head responsible for predicting PAM nucleotide probabilities, as shown in FIG. 2. The transformer captures key dependencies between amino acid residues relevant for PAM recognition, and the [CLS] token from the encoder’s final layer is passed to the MLP, which outputs a matrix representing the predicted nucleotide probabilities at each of the one or more PAM positions.
[0093] The machine learning model, particularly the transformer encoder, may be trained on any number of Cas protein sequences and their cognate PAM sequences. The Cas protein sequences may be derived from any type of Cas protein. For example, the Cas protein may be a Type I, Type II, or Type V Cas protein. In some embodiments, the Cas protein sequences used for training are derived from Cas proteins from a Type I, Type II, and / or Type V CRISPR system. In some embodiments, the model is trained on a comprehensive dataset of Cas proteins derived from multiple types of CRISPR systems paired with their natural PAMs,
[0094] In some embodiments, the training data comprises full sequences of Cas proteins. In some embodiments, the training data comprises PAM-interacting domain (PID) sequences. In some embodiments, the PAM for any of the Cas proteins or PID sequences was known, determined experimentally, or modeled. In some embodiments, the training data is weighted to those sequences similar to the input Cas protein sequence. For example, homologs (e.g., orthologs, paralogs) of an input Cas protein sequence may be upweighted in the training data.
[0095] Also provided herein are methods for training a model with full sequences and / or PID sequences of a database of natural or engineered Cas proteins derived from multiple types of CRISPR systems paired with their PAMs. The methods may use and the model may be trained by experimental data from high-throughput screening of Cas protein variants and their related PAMs.
[0096] In some embodiments, the training data does not comprise structural information for the Cas proteins. Thus, the methods predict the PAM recognition of any Cas protein based on amino acid sequence. In some embodiments, the methods utilize structural information. For example, structural information may be used to improve PAM engineering by modeling residues at DNA-binding interfaces.
[0097] Similar to the training Cas proteins, the Cas protein of interest, or input Cas protein, for which the PAM is being predicted may also be a Cas protein derived from any type of CRISPRPROF 43950.601
[0098] system. Tn some embodiments, the Cas protein of interest is a known Cas protein. Alternatively, the Cas protein of interest may be an engineered Cas protein. For example, the Cas protein of interest may be a rationally designed engineered Cas protein, an evolved engineered Cas protein, or a Cas protein generated using artificial intelligence. In some embodiments, the Cas protein of interest may be a variant Cas protein comprising one or more amino acid additions, deletions, or substitutions as compared to a known or parent Cas protein.
[0099] In some embodiments, the methods further comprise generating one or more variants of the Cas protein of interest and predicting the PAM sequence for each of the one or more variants. The one or more variants may comprise an amino acid sequence having one or more amino acid substitutions, deletions, or additions as compared to the Cas protein of interest amino acid sequence. The methods may iteratively introduce one or more amino acid substitutions in an input Cas protein sequence to direct engineering to a Cas protein recognizing a desired PAM.
[0100] The one or more amino acid substitutions, deletions, or additions may be in those positions identified as a PAM-specifying regions. Alternatively or additionally, the one or more amino acid substitutions, deletions, or additions may be in the PAM-interacting domain. For example, the method may iteratively introduce combinatorial variants of a Cas protein sequence followed by PAM prediction to drive the engineered Cas protein to be configured to recognize a desired PAM sequence. Thus, the methods may be used to engineer a Cas protein with a particular PAM sequence, e.g., for use with a desired target sequence.
[0101] In some embodiments, the methods further comprise distinguishing high-confidence and low-confidence PAM predictions. The distinguishing comprises quantifying confidence based on protein language model (pLM) embeddings and sequence identity as compared to sequences used to train the machine learning model. For example, as schematized in FIG. 2, based on sequence identity the nearest training Cas protein sequences to the Cas protein of interest sequence are identified and encoded into a vector then projected into a dimensional space. The vector is added to the dimensional space embedding from the transformer encoder before passing the combined vector through a multi-layer perceptron. A sigmoid activation was applied to the MLP output to be interpreted as the predicted PAM similarity.
[0102] The models and the means of training the models may be adapted to other nucleic acid binding proteins which have a sequence specificity, e.g., recombinases, transposases, and zinc fingers. Thus, similar models may be used to predict the recognitions sequences for other nucleicPROF 43950.601
[0103] acid binding proteins. In such instances, the models may be trained with a plurality of proteins of similar type as the input sequence and their respective nucleic acid binding sequences, and weighted as necessary based on sequence similarity, such that their sequence specificity can be predicted.
[0104] In some embodiments, the technology described herein is associated with a programmable machine designed to perform a sequence of arithmetic or logical operations as provided by the methods described herein. For example, some embodiments of the technology are associated with (e.g., implemented in) computer software and / or computer hardware. In one aspect, the technology relates to a computer comprising a form of memory, an element for performing arithmetic and logical operations, and a processing element (e.g., a microprocessor) for executing a series of instructions (e.g., a method as provided herein) to read, manipulate, and store data.
[0105] In some embodiments, the various embodiments of the present disclosure are associated with a plurality of programmable devices that operate in concert to perform a method as described herein. For example, in some embodiments, a plurality of computers (e.g., connected by a network) may work in parallel to collect and process data, e.g., in an implementation of cluster computing or grid computing or some other distributed computer architecture that relies on complete computers (with onboard CPUs, storage, power supplies, network interfaces, etc.) connected to a network (private, public, or the internet) by a conventional network interface, such as Ethernet, fiber optic, or by a wireless network technology.
[0106] For example, some embodiments provide a computer that includes a computer-readable medium. The embodiment includes a random access memory (RAM) coupled to a processor. The processor executes computer-executable program instructions stored in memory. Such processors may include a microprocessor, an ASIC, a state machine, or other processor, and can be any of a number of computer processors, such as processors from Intel Corporation of Santa Clara, California and Motorola Corporation of Schaumburg, Illinois. Such processors include, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein.
[0107] Computers are connected in some embodiments to a network. Computers may also include a number of external or internal devices such as a mouse, a CD-ROM, DVD, a keyboard,PROF 43950.601
[0108] a display, or other input or output devices. Examples of computers are personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, internet appliances, and other processor-based devices. In general, the computers related to aspects of the technology provided herein may be any type of processor-based platform that operates on any operating system, such as Microsoft Windows, Linux, UNIX, Mac OS X, etc., capable of supporting one or more programs comprising the technology provided herein. Some embodiments comprise a personal computer executing other application programs (e.g., applications). The applications can be contained in memory and can include, for example, a word processing application, a spreadsheet application, an email application, an instant messenger application, a presentation application, an Internet browser application, a calendar / organizer application, and any other application capable of being executed by a client device. All such components, computers, and systems described herein as associated with the technology may be logical or virtual.
[0109] Engineered Cas Proteins
[0110] Disclosed herein are engineered Cas proteins. In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions, deletions, or additions configured to modify Protospacer Adjacent Motif (PAM) recognition and / or specificity. In some embodiments, the engineered Cas protein comprises one or more amino acid substitutions, deletions, or additions in the PAM-interacting domain. See FIG. 4D for schematics showing the PAM-interacting domains of various Cas proteins.
[0111] The Cas protein may be any Cas protein which utilizes a PAM for recognition of a target DNA. In some embodiments, the Cas protein is Cas protein from a Type I, Type II, or Type V CRISPR system. In some embodiments, the Cas protein is Cas8. In some embodiments, the Cas protein is Cas9. In some embodiments, the Cas protein is Cas 12.
[0112] In some embodiments, the engineered Cas proteins comprise an amino acid sequence having one or more substitutions, insertions, or deletions relative to SEQ ID NO: 1 which modifies Protospacer Adjacent Motif (PAM) recognition and / or specificity of the engineered Cas protein as compared to that of SEQ ID NO: 1. In some embodiments, the engineered Cas proteins comprise an amino acid sequence having at least 70% identity (e.g., having at least 75%, at least 80%, at least 85%, at least 90%, at least 92%, at least 95%, at least 96%, at least 97%, at least 98%, or at least 99%) to SEQ ID NO: 1.PROF 43950.601
[0113] In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions relative to SEQ ID NO: 1. In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions relative to SEQ ID NO: 1 as shown in Table 1.
[0114] In some embodiments, the one or more amino acid substitutions are at positions selected from: 938, 940, 954, 956, 957, 960, 966, 979, 980, 981, 982, 987, 989, 990, 995, 996, 1000, 1003, 1007, 1008, 1010, 1011, 1014, 1015, 1016, 1017, 1018, 1020, 1021, 1022, 1023, 1024, 1026, 1027, 1029, 1030, 1031, 1032, 1034, 1037, 1038, 1039, 1040, 1041, 1044, 1046, 1048, 1050, 1053, 1055, 1056, relative to SEQ ID NO: 1. In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions selected from: H938D or H938G;
[0115] G940A; E954C; G956A; D957G; Y960H; S966A; V979I or V979C; V980K or V980I; Q981A, Q981T, Q981R, or Q981G; G982F or G982Y; D987N; Q989T; L990V; F995Y; N996Q, N996E, orN996K; S1000V; P1003S; V1007I; E1008K; I1010K or I1010T; T1011A; A1014N, A1014D, or A1014K; R1015S; M1016F, M1016K, M1016R, or M1016I; F1017L; G1018A; F1020Y; A1021S, A1021V, or A1021I; S1022G or S1022N; C1023L or C1023F; H1024D or H1024E; G1026E, G1026S, or G1026A; T1027N or T1027D; N1029A, N1029G, N1029H, or N1029R; I1030F; N1031S orN1031D; I1032L; I1034T; L1037K; D1038E; H1039K or H1039N; K1040T orK1040S; I1041K or I1041V; N1044D; I1046V; E1048R orE1048Q; I1050V; K1053Q;
[0116] A1055L; and L1056V, relative to SEQ ID NO: 1.
[0117] In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions at positions: 981, 1024, and 1029, relative to SEQ ID NO: 1. In select embodiments the engineered Cas proteins comprise a substitution at position 981. In select embodiments the engineered Cas proteins comprise a substitution at position 981 and a substitution at position 1024 and / or position 1029. In select embodiments the engineered Cas proteins comprise a substitution at position 1024. In select embodiments the engineered Cas proteins comprise a substitution at position 1024 and a substitution at position 981 and / or position 1029. In select embodiments the engineered Cas proteins comprise a substitution at position 1029. In select embodiments the engineered Cas proteins comprise a substitution at position 1029 and a substitution at position 981 and / or position 1024. In some embodiments, the engineered Cas proteins comprise one or more amino acid substitutions selected from: Q981A, H1024D or H1024E, and N 1029 A or N1029G, relative to SEQ ID NO: 1.PROF 43950.601
[0118] In some embodiments, the engineered Cas proteins further comprise one or more amino acid substitutions at positions: 957, 982, 1026, and 1048. For example, the engineered Cas proteins may comprise one or more amino acid substitutions at positions: 981, 1024, and 1029, as described above. In some embodiments, the engineered Cas proteins further comprise one or more amino acid substitutions selected from: D957G, G982F or G982Y, G1026S or G1026A, and E1048R orE1048Q.
[0119] In some embodiments, the engineered Cas proteins comprise an amino acid sequence with at least 70% identity (e.g., at least 75%, at least 80%, at least 85%, at least 90%, at least 91%, at least 92%, at least 93%, at least 94%, at least 95%, at least 96%, at least 97%, at least 98%, at least 99%) to any of SEQ ID NOs: 2-31. In some embodiments, the engineered Cas proteins comprise any of SEQ ID NOs: 2-31.
[0120] Also disclosed herein are engineered Cas proteins comprising one or more substitutions at positions as shown in Table 1 when compared to a “PARENT” Cas protein sequence.
[0121] Any of the engineered Cas proteins described herein may comprise one or more (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, or more, etc.) amino acid substitutions as compared to the recited sequences. An amino acid “replacement” or “substitution” refers to the replacement of one amino acid at a given position or residue by another amino acid at the same position or residue within a polypeptide sequence. Amino acids are broadly grouped as “aromatic” or “aliphatic.” An aromatic amino acid includes an aromatic ring. Examples of aromatic amino acids include histidine (H or His), phenylalanine (F or Phe), tyrosine (Y or Tyr), and tryptophan (W or Trp). Non-aromatic amino acids are broadly grouped as aliphatic. Examples of aliphatic amino acids include glycine (G or Gly), alanine (A or Ala), valine (V or Vai), leucine (L or Leu), isoleucine (I or He ), methionine (M or Met), serine (S or Ser), threonine (T or Thr), cysteine (C or Cys), proline (P or Pro), glutamic acid (E or Glu), aspartic acid (A or Asp), asparagine (N or Asn), glutamine (Q or Gin), lysine (K or Lys), and arginine (R or Arg).
[0122] The amino acid replacement or substitution can be conservative, semi-conservative, or non-conservative. The phrase “conservative amino acid substitution” or “conservative mutation” refers to the replacement of one amino acid by another amino acid with a common property. A functional way to define common properties between individual amino acids is to analyze the normalized frequencies of amino acid changes between corresponding proteins of homologous organisms (Schulz and Schirmer, Principles of Protein Structure, Springer-Verlag, New YorkPROF 43950.601
[0123] (1979)). According to such analyses, groups of amino acids may be defined where amino acids within a group exchange preferentially with each other and therefore resemble each other most in their impact on the overall protein structure (Schulz and Schirmer). Examples of conservative amino acid substitutions include substitutions of amino acids within the sub-groups described above, for example, lysine for arginine and vice versa such that a positive charge may be maintained, glutamic acid for aspartic acid and vice versa such that a negative charge may be maintained, serine for threonine such that a free -OH can be maintained, and glutamine for asparagine such that a free -NH2 can be maintained. “Semi-conservative mutations” include amino acid substitutions of amino acids within the same groups listed above, but not within the same sub-group. For example, the substitution of aspartic acid for asparagine, or asparagine for lysine, involves amino acids within the same group, but different sub-groups. “Non-conservative mutations” involve amino acid substitutions between different groups, for example, lysine for tryptophan, or phenylalanine for serine, etc.
[0124] The present disclosure also provides fusion proteins comprising the engineered Cas proteins described herein and one or more effector or functional domains. The fusion proteins are not limited by orientation or directionality of the engineered Cas protein and the one or more effector domains. For example, any single effector domain may be fused to the N-terminus or C-terminus of the engineered Cas protein, in any orientation, e.g., N-terminus to N-terminus, C-terminus to C-terminus, N-terminus to C-terminus, or C-terminus to N-terminus, and directly or indirectly (e g., fused to another effector domain fused to the N-terminus).
[0125] Effector or functional domains are proteins or fragments thereof that can modify, regulate, or act as a tag for a target nucleic acid. For example, an effector domain can be used to target enzymatic activities to a nucleic acid sequence which the engineered Cas protein targets (e g., by way of a guide RNA, described elsewhere herein). In some embodiments, an effector domain is a fragment of protein that has been separated from its natural DNA binding domain and engineered to be part of a fusion protein with an engineered Cas protein described herein. In some embodiments, an effector domain is a protein which normally binds to other proteins or factors for recruitment to a specific or non-specific nucleic acid.
[0126] An effector or functional domain may comprise a number of functionalities, including but not limited to, recombinase function, epigenetic modifying function (e.g., histone acetylase function, histone deacetylase function), integrase function, resolvase function, invertase function,PROF 43950.601
[0127] protease function, nuclease function, DNA methyltransferase function, DNA demethylase function, transcriptional repressor function, transcriptional activator function, DNA binding protein function, transcription factor recruiting protein function, nuclear-localization signal function, DNA editing function (e.g., deaminase), degradation signaling, or any combination thereof. In some embodiments, the one or more effector or functional domains include a transcription activator, a transcription repressor, a deaminase, a polymerase (e.g., reverse transcriptase), an epigenetic modifier, a detection agent (e.g., fluorescent protein or protein tag), or a combination thereof.
[0128] In some embodiments, the fusion protein is used to modulate gene regulatory activity, such as transcriptional or translational activity. For example, the one or more effector domain comprises a transcription activator and / or repressor activity that can affect transcription upstream and downstream of coding regions, and can be used to activate or repress gene expression. In some embodiments, the one or more effector domain includes domains from transcription factors (activators, repressors, coactivators, co-repressors), silencers, and / or chromatin associated proteins and their modifiers (e.g., methylases, demethylases, acetylases and deacetylases).
[0129] In some embodiments, the one or more effector domain comprises transcriptional repressor function. Transcription repressors prevent, partially or completely, the transcription of genes near to the target site. Exemplary transcriptional repressors include, but are not limited to, KRAB-domain containing proteins, SID, and Spl.
[0130] In some embodiments, the one or more effector domain comprises transcriptional activator function. Transcriptional activators can be generally defined as proteins, or domains thereof, that bind to specific sites on promoter DNA and bring about increased transcription of specific genes through interactions with other proteins. Exemplary transcriptional activators include, but are not limited to, VP64, p65, p53, c-Myb, GATA-1, EKLF, MyoD, E2F, dTCF, Tat, HSF1, RTA and SET7 / 9.
[0131] In some embodiments, a fusion protein as disclosed herein can comprise an effector domain comprising a transcriptional effector recruiting domain, or active fragment thereof. The transcriptional effector recruiting domain can recruit transcriptional activators or repressors, e.g., to the specific nucleic acid sequence which the engineered Cas protein is bound to localize activators and repressors to modulate gene expression in a targeted manner.PROF 43950.601
[0132] In some embodiments, the one or more effector domain comprises DNA methyltransferase or DNA methylase function. DNA methyltransferases (DNMT’s) are a family of DNA modifying proteins composed of different isomers (e.g., DNMT1, DNMT3A, and DNMT3B). Other exemplary DNA methyltransferases include Sssl methylase, Alul methylase, Haelll methylase, Hhal methylase, and Hpall methylase. Their main mechanism of action is addition of a methyl group to the fifth carbon of a cytosine residue (5mc) located adjacent to a guanine residue. DNA demethylation can be mediated by at least three enzyme families: (i) the ten-eleven translocation (TET) family, mediating the conversion of 5mC into 5hmC; (ii) the AID / APOBEC family, acting as mediators of 5mC or 5hmC deamination; and (iii) the BER (base excision repair) glycosylase family involved in DNA repair.
[0133] In some embodiments, the one or more effector domain modifies epigenetic signals and thereby modify gene regulation, for example by promoting histone acetylase and histone deacetylase activity. The term “epigenetic modifier,” as used herein, refers to a protein or catalytic domain thereof having enzymatic activity that results in the epigenetic modification of DNA, for example, chromosomal DNA. Epigenetic modifications include, but are not limited to, histone modifications including methylation and demethylation (e.g., mono-, di- and trimethylation), histone acetylation and deacetylation, as well as histone ubiquitylation, phosphorylation, and sumoylation.
[0134] Histone acetylation and deacetylation are the processes by which the lysine residues within the N-terminal tail protruding from the histone core of the nucleosome are acetylated and deacetylated as part of gene regulation. These reactions are typically catalyzed by enzymes with histone acetyltransferase (HAT) or histone deacetylase (HD AC) activity. Histone acetyltransferases include GNAT family proteins (e.g., Gcn5, Gcn5L, p300 / CREB-binding protein associated factor (PCAF), Elp3, HPA2 and HAT1) and MYST family proteins (e.g., Sas3, essential SAS-related acetyltransferase (Esal), Sas2, Tip60, MOF, MOZ, MORE, and HBO1). Histone deacetylases fall into four classes. Class I includes HDACs 1, 2, 3, and 8. Class II is divided into two subgroups, Class IIA and Class IIB. Class IIA includes HDACs 4, 5, 7, and 9 while Class IIB includes HDACs 6 and 10. Class III contains the Sirtuins and Class IV contains only HD AC 11. Classes of HD AC proteins are divided and grouped together based on the comparison to the sequence homologies of Rpd3, Hosl and Hos2 for Class I HDACs, HDA1 and Hos3 for the Class II HDACs and the sirtuins for Class III HDACs.PROF 43950.601
[0135] The site-specific methylation and demethylation of histone residues are catalyzed by methyltransferases and demethylases, respectively. Histone methylases transfer methyl groups to amino acids (e.g., lysine and arginine) of histone proteins, ultimately affecting transcription of genes. Methylases include SET1, MLL, SMYD3, G9a, GLP, EZH2, and SETDB1. Histone demethylases catalyze the removal of methyl marks from histones, an activity associated with transcriptional regulation and DNA damage repair. Demethylases include, for example, KDM1A, KDM1B, KDM2A, KDM2B, UTX, UTY, Jumonji C (JmJC) domaincontaining demethylases, and GSK-J4.
[0136] In some embodiments, the one or more effector domain comprises recombinase activity. A recombinase is a site-specific enzyme that mediates the recombination of DNA between recombinase recognition sequences, which results in the excision, integration, inversion, or exchange (e g., translocation) of DNA fragments between the recombinase recognition sequences. Recombinases can be classified into two distinct families: serine recombinases (e.g., resolvases and invertases) and tyrosine recombinases (e.g., integrases). Examples of serine recombinases include, without limitation, Hin, Gin, Tn3 (also known as TnpR), 0-six, CinH, ParA, y5, Bxbl, 4>C31, TP901, TGI, <|)BT1, R4, c^RVl, <|>FC1, MR11, Al 18, U153, and gp29. Examples of tyrosine recombinases include, without limitation, Cre, FLP, R, Lambda, HK101, HK022, and pSAM2.
[0137] In some embodiments, the one or more effector domain comprises DNA editing function (e.g., deaminase, DNA repair activity, DNA damage activity, dismutase activity, alkylation activity, depurination activity, oxidation activity, pyrimidine dimer forming activity, polymerase activity (e.g., reverse transcriptase), ligase activity, helicase activity, photolyase activity or glycosylase activity). In some embodiments, the one or more effector domain comprises a reverse transcriptase. In some embodiments, the one or more effector domain comprises a deaminase, e.g., a cytosine deaminase or an adenine deaminase.
[0138] Kinases, phosphatases, and other proteins that modify or regulate other polypeptides involved in gene regulation are also useful as effector domains. Such modifiers are often involved in switching on or off transcription mediated by, for example, hormones. Other useful domains for regulating gene expression can also be obtained from the gene products of oncogenes (e.g., myc, jun, fos, myb, max, mad, rel, ets, bcl, myb, mos family members) and their associated factors and modifiers.PROF 43950.601
[0139] In some embodiments, the one or more effector domain comprises an integrase.
[0140] Integrases allow for the insertion of nucleic acids, for example, into a host genome (mammalian, human, mouse, rat, monkey, frog, fish, plant (including crop plants and experimental plants like Arabidopsis), laboratory or biomedical cell lines or primary cell cultures, C. elegans, fly (Drosophila), etc.). Integrases are found in a retrovirus such as HIV (human immunodeficiency virus) and lambda integrase.
[0141] In some embodiments, the one or more effector domain comprises transposase functionality. Transposases are enzymes that bind to the end of a transposon and catalyze its movement by a cut and paste mechanism or a replicative transposition mechanism. Exemplary transposases include, but are not limited to, Tel transposase, Mosl transposase, Tn5 transposase, and Mu transposase.
[0142] In some embodiments, the one or more effector domain comprises invertase activity. Invertase activity can be used to alter genome structure by swapping the orientation of a DNA fragment.
[0143] In some embodiments, the one or more effector domain comprises resolvase activity. Resolvases are site-specific recombinases that function to excise (as a circle) a segment of DNA contained between two recombination sites (called res) and include, for example, Ruv C resolvase, Holiday junction resolvase Hjc, Tn3 and <) resolvase.
[0144] In some embodiments, the one or more effector domain comprises a peptide or polypeptide sequence responsive to a ligand, such as a hormone receptor ligand binding domain, including, for example, the ligand binding domains of the estrogen receptor, the glucocorticosteroid receptor, and the like. Such effector domains can be used to act as “gene switches,” and be regulated by inducers, such as small molecule or protein ligands, specific for the ligand binding domain.
[0145] In some embodiments, the one or more effector domain comprises sequences or domains of polypeptides that mediate direct or indirect protein-protein interactions, including, for example, a leucine zipper domain, a STAT protein N terminal domain, and / or an FK506 binding protein.
[0146] In some embodiments, the activity mediated by the one or more effector domain is a non-biological activity, such as a fluorescence activity (e.g., fluorescent proteins), luminescence activity (e g., a luminescent protein or enzyme which results in luminescence when interactingPROF 43950.601
[0147] with a substrate (e.g., luciferase)), or binding activity, such as those mediated by maltose binding protein (“MBP”), glutathione S transferase (GST), hexahistidine, c-myc, and the FLAG epitope, for facilitating detection, purification, monitoring expression, and / or monitoring cellular and subcellular localization of the engineered Cas protein to which the effector domain is appended. In such embodiments, the fusion proteins can also be used as a diagnostic reagent, for example, to detect mutations in gene sequences, to purify restriction fragments from a solution, or to visualize nucleic acids.
[0148] In some embodiments, the effector domain facilitates temporal modulation of the fusion protein, and accordingly the engineered Cas protein. Thus, in some embodiments, the effector domain is a degron. Degrons may be ubiquitin-independent degrons, not necessary for the polyubiquitination of their protein. Alternatively, ubiquitin-dependent degrons are implicated in the polyubiquitination process for targeting a protein to the proteasome. For example, the effector domain may comprise a truncated geminin protein. Geminin is a direct substrate of E3 ubiquitin ligase complex APC / Cdhl and is actively ubiquitinated in the M / Gl phase. Thus, degradation of the fusion protein will be promoted during the M / Gl phase thereby restricting activity of the fusion protein to largely the G2 / S phase.
[0149] The effector domains described herein are illustrative and merely provide the skilled artisan with examples of effectors that can be used in combination with the engineered Cas proteins, systems and methods described herein.
[0150] In some embodiments, the one or more effector or functional domain and the engineered Cas protein are covalently linked in a single amino acid chain through a linker. The linker may have any of a variety of amino acid sequences. Proteins can be joined by a linker polypeptide, generally of a flexible nature, although other chemical linkages are not excluded. Suitable linkers include polypeptides of between 1 amino acid and 100 amino acids in length, 4 amino acids and 40 amino acids in length, or between 4 amino acids and 25 amino acids in length. The linking peptides may have virtually any amino acid sequence, bearing in mind that the preferred linkers will have a sequence that results in a generally flexible peptide. Small amino acids, such as glycine and alanine, are useful in creating a flexible peptide linker. A variety of different linkers are considered suitable for use, including but not limited to, glycine-serine polymers, glycinealanine polymers, and alanine-serine polymers. Such fusion proteins can be expressed recombinantly from a single nucleic acid encoding the amino acid chain.PROF 43950.601
[0151] Alternatively, the one or more effector domain and the engineered Cas protein may be individually fused to one half of a binding pair (e.g., from a recruitment system) and, when introduced into the same system or location, the one or more effector domain and the engineered Cas protein form a protein conjugate through the recruitment system. The recruitment system can comprise any binding pair. For example, the recruitment system may comprise an aptamer and an aptamer binding protein. The recruitment system may be a so-called split system. Split systems include two or more polypeptide chains that reassemble into an operable fusion protein or protein conjugate upon association of the two binding partners. Split systems include, but are not limited to, intein, MS2, or SunTag based systems.
[0152] In some embodiments, the aptamer sequence is a nucleic acid (e.g., RNA aptamer) sequence. In some embodiments, the guide RNA also comprises a sequence of one or more RNA aptamers, or distinct RNA secondary structures or sequences that can recruit and bind another molecular species, an adaptor molecule, such as a nucleic acid or protein. Any RNA aptamer / aptamer binding protein pair known may be selected and used in connection with the present disclosure (see, e.g., Jayasena, S.D., Clinical Chemistry. 45(9): p. 1628-1650, (1999); Gelinas, et al., Current Opinion in Structural Biology 36: p. 122-132, (2016); and Hasegawa, H., Molecules,21(4): p. 421 (2016), incorporated herein by reference).
[0153] In some embodiments, the aptamer sequence is a peptide aptamer sequence. In some embodiments, the engineered Cas protein comprises the peptide aptamer sequence and the effector domain comprises the peptide aptamer binding protein. In some embodiments, the effector domain comprises the peptide aptamer sequence and the engineered Cas protein comprises the peptide aptamer binding protein. The peptide aptamer sequence or peptide aptamer binding protein may be fused in any orientation (e.g., N-terminus to C-terminus, C-terminus to N-terminus, N-terminus to N-terminus). The peptide aptamer sequence or peptide aptamer binding protein may be fused by a linker region. Suitable linker regions are known in the art. The linker may be flexible or configured to allow the functionality and association with the DNA or other proteins with decreased steric hindrance. The linker sequences may provide an unstructured or linear region of the polypeptide, for example, with the inclusion of one or more glycine and / or serine residues. The linker sequences can be at least about 2, 3, 4, 5, 6, 7, 8, 9, 10 or more amino acids in length.PROF 43950.601
[0154] The peptide aptamers can be naturally occurring or synthetic peptides that are specifically recognized by an affinity agent. Such aptamers include, but are not limited to, a c-Myc affinity tag, an HA affinity tag, a His affinity tag, an S affinity tag, a methionine-His affinity tag, an RGD-His affinity tag, a 7* His tag, a FLAG octapeptide, a strep tag or strep tag II, a V5 tag, or a VSV-G epitope. Corresponding aptamer binding proteins are well-known in the art and include, for example, primary antibodies, biotin, affimers, single domain antibodies, and antibody mimetics.
[0155] Any of the effector domains and engineered Cas proteins disclosed herein may further comprise one or more proteins, polypeptides (e.g., protein domain sequences), or peptides. For example, the engineered Cas proteins and / or fusion proteins disclosed herein may be fused to another protein or protein domain that provides for tagging or visualization. The one or more proteins, polypeptides (e g., protein domain sequences), or peptides may be appended at an N-terminus, a C-terminus, internally, or a combination thereof. The one or more proteins, polypeptides (e.g., protein domain sequences), or peptides may be fused in any orientation in relationship to the disclosed protein. The one or more proteins, polypeptides (e.g., protein domain sequences), or peptides may be fused via a linker, as described above.
[0156] In some embodiments, the engineered Cas proteins further comprise a localization or signal sequence (e.g., nuclear localization sequence), a sequence tag (e.g., a tag for detection, purification, and / or monitoring expression), a protein transduction domain sequence, or a combination thereof.
[0157] In some embodiments, the engineered Cas proteins comprise one or more nuclear localization sequences (NLSs). The nuclear localization sequence may be appended, for example, to the N-terminus, a C-terminus, internally, or a combination thereof. In such cases when the engineered Cas proteins comprise two or more NLSs, the NLSs may be in tandem, separated by a linker, at either end of the protein, or one or more may be embedded in the protein.
[0158] The nuclear localization sequence may comprise any amino acid sequence known in the art to functionally tag or direct a protein for import into a cell’s nucleus (e.g., for nuclear transport). Usually, a nuclear localization sequence comprises one or more positively charged amino acids, such as lysine and arginine. The NLS may be appended by a linker.PROF 43950.601
[0159] In some embodiments, the NLS is a monopartite sequence. A monopartite NLS comprises a single cluster of positively charged or basic amino acids. In some embodiments, the monopartite NLS comprises a sequence of K-K / R-X-K / R, wherein X can be any amino acid. Exemplary monopartite NLS sequences include those from the SV40 large T-antigen, c-Myc, and TUS-proteins. In some embodiments, the NLS is a bipartite sequence. Bipartite NLSs comprise two clusters of basic amino acids, separated by a spacer of about 9-12 amino acids. Exemplary bipartite NLSs include the nuclear localization sequences of nucleoplasmin, EGL-12, or bipartite SV40.
[0160] The engineered Cas proteins may also comprise a tag (e.g., 3xFLAG tag, an HA tag, a Myc tag, a poly-histidine tag, a SNAP-tag, a CLIP -tag, and the like). The tags may be at the N-terminus, a C-terminus, or a combination thereof of the engineered Cas proteins. In some embodiments, the tag may be adjacent, either upstream or downstream, to a nuclear localization sequence.
[0161] In some embodiments, the engineered Cas proteins may be fused with one or more (e.g., two, three, four, or more) protein transduction moieties. A protein transduction moiety is a polypeptide, polynucleotide, carbohydrate, or organic or inorganic compound that facilitates traversing a lipid bilayer, micelle, cell membrane, organelle membrane, or vesicle membrane. A protein transduction moiety attached to another molecule facilitates the molecule traversing a membrane, for example going from extracellular space to intracellular space, or cytosol to within an organelle.
[0162] Accordingly, in some embodiments, the engineered Cas proteins comprise one or more polypeptide transduction moieties. The polypeptide transduction moiety may be at a terminus of the engineered Cas protein (e.g., N-terminus or C-terminus), or alternatively be inserted internally. Examples of polypeptide transduction moieties include but are not limited to a minimal undecapeptide polypeptide transduction domain (corresponding to residues 47-57 of HIV-1 TAT comprising); a polyarginine sequence comprising a number of arginines sufficient to direct entry into a cell (e.g., 3, 4, 5, 6, 7, 8, 9, 10, or 10-50 arginines); a VP22 domain (Zender et al. (2002) Cancer Gene Ther. 9(6):489-96); a Drosophila Antennapedia protein transduction domain (Noguchi et al. (2003) Diabetes 52(7): 1732-1737); a truncated human calcitonin peptide (Trehin et al. (2004) Pharm. Research 21:1248-1256); polylysine (Wender et al. (2000) Proc. Natl. Acad. Sci. USA 97:13003-13008); transportan, and the like.PROF 43950.601
[0163] Disclosed herein are systems and compositions that comprise an engineered Cas protein or fusion protein as described herein, or one or more nucleic acids encoding the engineered Cas protein or fusion protein. In some embodiments, the engineered Cas protein and the one or more effector domains are provided in the system as separate polypeptides or nucleic acid(s) encoding thereof, in which each is linked to a half of a binding pair. Descriptions of the engineered Cas proteins, effector domains, and fusion proteins provided above are equally applicable to the systems and compositions.
[0164] In some embodiments, the compositions or systems further comprise at least one guide RNA (gRNA) or one or more nucleic acids comprising a sequence encoding the least one gRNA. In instances when the composition or system comprises more than one gRNA, each may be encoded on the same or different nucleic acid as the other gRNA, together or separate from the engineered Cas protein, fusion protein, or effector domain.
[0165] The gRNA may contain separate crRNA and tracrRNA sequences (a dual guide RNA), have the crRNA and tracrRNA fused by a flexible linker, or be a single guide RNA, sgRNA. The terms “gRNA,” “guide RNA,” and “gRNA” may be used interchangeably throughout to represent any of the form of gRNA.
[0166] The “guide sequence” refers to the sequence that hybridizes to (complementary to, partially or completely) a target nucleic acid sequence (e.g., the genome in a host cell) and therefore determines the sequence specificity of the gRNA. The portion of the gRNA that hybridizes to the target nucleic acid (a target site) is generally between 10-40 nucleotides in length, but can be longer based on the specific target. gRNAs or sgRNA(s) used in the present disclosure can be between about 5 and 100 nucleotides long, or longer. The gRNA may be a non-naturally occurring or engineered gRNA.
[0167] In some embodiments, the guide sequence and scaffold sequence are separate. In such embodiments, the guide sequence is appended to an additional sequence that is complementary to a portion of the scaffold sequence and functions to hybridize with a portion of the scaffold sequence.
[0168] In some embodiments, the guide sequence is fused to a scaffold sequence (e.g., a tracrRNA). Such a chimeric gRNA is referred to as a single guide RNA (sgRNA). Exemplary scaffold sequences will be evident to one of skill in the art and can be found, for example, inPROF 43950.601
[0169] Jinek, et al. Science (2012) 337(6096):816-821, and Ran, et al. Nature Protocols (2013) 8:2281 -2308, incorporated herein by reference in their entireties.
[0170] “Complementarity” refers to the ability of a nucleic acid to form hydrogen bond(s) with another nucleic acid sequence by either traditional Watson-Crick or other non-traditional types. A percent complementarity indicates the percentage of residues in a nucleic acid molecule, which can form hydrogen bonds (e.g., Watson-Crick base pairing) with a second nucleic acid sequence. Full complementarity is not necessarily required, provided there is sufficient complementarity to cause hybridization. There may be mismatches distal from the PAM.
[0171] To facilitate gRNA design, many computational tools have been developed. Methods and tools for guide RNA design are discussed by Zhu (Frontiers in Biology, 10 (4) pp. 289-296 (2015)), which is incorporated by reference herein. Additionally, there are many publicly available software tools that can be used to facilitate the design of sgRNA(s); including but not limited to, Genscript Interactive CRISPR gRNA Design Tool, WU-CRISPR, and Broad Institute GPP sgRNA Designer. There are also publicly available pre-designed gRNA sequences to target many genes and locations within the genomes of many species (human, mouse, rat, zebrafish, C. elegans), including but not limited to, IDT DNA Predesigned Alt-R CRISPR-Cas9 guide RNAs, Addgene Validated gRNA Target Sequences, and GenScript Genome-wide gRNA databases.
[0172] The target sequence may or may not be flanked by a protospacer adjacent motif (PAM) sequence. In some embodiments, the target nucleic acid is flanked by a protospacer adjacent motif (PAM). A PAM site is a nucleotide sequence in proximity to a target sequence. PAM may be predicted using the methods disclosed herein. In certain embodiments, the disclosed Cas proteins cleave a target sequence if an appropriate PAM is present.
[0173] In some embodiments, the compositions and systems may further comprise one or more additional genome engineering tools. For example, the compositions may further comprise nucleases, such as zinc finger nucleases (ZFNs) and / or transcription activator like effector nucleases (TALENs); transcriptional activators, transcriptional repressors, hi stone-modifying proteins, integrases, recombinases, and the like.
[0174] Nucleic acids or vectors may be used to propagate the engineered Cas proteins disclosed herein in an appropriate cell and / or to allow expression from the segment (e.g., an expression vector). The person of ordinary skill in the art would be aware of the various vectors available for propagation and expression of a nucleic acid sequence. The vector(s) and nucleic acid(s) canPROF 43950.601
[0175] be introduced into a cell that is capable of expressing the polypeptide encoded thereby, including any suitable prokaryotic or eukaryotic cell.
[0176] In certain embodiments, the nucleic acids are engineered for codon-optimization. It will be appreciated altering codons to those most frequently used in the cells or subject of interest allows for maximum expression. Such modified nucleic acid sequences are commonly described in the art as “codon-optimized.” In some embodiments, the nucleic acid sequence is considered codon-optimized if at least about 60% (e.g., about 65%, about 70%, about 75%, about 80%, about 85%, about 90%, about 95%, or about 98%) of the codons encoded therein are preferred codons to the subject of interest.
[0177] In one embodiment, a DNA segment encoding the engineered Cas proteins disclosed herein is contained in a plasmid vector that allows expression of the protein and subsequent isolation and purification of the protein produced by the recombinant vector. Accordingly, the engineered Cas proteins disclosed herein can be purified following expression, obtained by chemical synthesis, or obtained by recombinant methods.
[0178] The present disclosure further provides engineered, non-naturally occurring vectors and vector systems, which can encode the engineered Cas proteins, fusion proteins, or one or more or all of the components of the systems or compositions, as disclosed herein. The vector(s) can be introduced into a cell that is capable of expressing the polypeptide encoded thereby, including any suitable prokaryotic or eukaryotic cell.
[0179] To construct cells that express the engineered Cas proteins disclosed herein, expression vectors for stable or transient expression of the engineered Cas proteins may be constructed via methods as described herein or known in the art and introduced into cells. For example, nucleic acids encoding the engineered Cas proteins may be cloned into a suitable expression vector, such as a plasmid or a viral vector in operable linkage to a suitable promoter.
[0180] Vectors according to the present disclosure can be transformed, transfected, or otherwise introduced into a wide variety of host cells. Transfection refers to the taking up of a vector by a host cell whether or not any coding sequences are in fact expressed. Numerous methods of transfection are known to the ordinarily skilled artisan, for example, lipofectamine, calcium phosphate co-precipitation, electroporation, DEAE-dextran treatment, microinjection, viral infection, and other methods known in the art.PROF 43950.601
[0181] Further disclosed herein are compositions comprising the engineered Cas proteins described herein. The compositions may comprise excipients or pharmaceutically acceptable carriers. The choice of excipients or pharmaceutically acceptable carriers will depend on factors including, but not limited to, the particular mode of administration, the effect of the excipient on solubility and stability, and the nature of the dosage form. The compositions of the present invention will be readily apparent to those skilled in the art. Techniques and formulations may be found, for example, in Remington’s Pharmaceutical Sciences, 19th Edition (Mack Publishing Company, 1995).
[0182] The term “pharmaceutically acceptable carrier,” as used herein, means a non-toxic, inert solid, semi-solid or liquid filler, diluent, encapsulating material, surfactant, cyclodextrins or formulation auxiliary of any type. The phrase “pharmaceutically acceptable,” as used in connection with the present disclosure, refers to molecular entities and other ingredients of such compositions that are physiologically tolerable and do not typically produce untoward reactions when administered to a subject (e.g., a mammal, a human). Preferably, as used herein, the term “pharmaceutically acceptable” means approved by a regulatory agency of the federal or a state government or listed in the U.S. Pharmacopeia or other generally recognized pharmacopeia for use in mammals, and more particularly in humans. “Acceptable” means that the carrier is compatible with the composition (e.g., the nucleic acids, vectors, cells, proteins, or polypeptides) and does not negatively affect the subject to which the composition(s) are administered. Any of the compositions to be used in the present methods can comprise pharmaceutically acceptable carriers, excipients, or stabilizers in the form of lyophilized formations or aqueous solutions.
[0183] Some examples of materials which can serve as pharmaceutically acceptable carriers are sugars such as, but not limited to, lactose, glucose and sucrose; starches such as, but not limited to, corn starch and potato starch; cellulose and its derivatives such as, but not limited to, sodium carboxymethyl cellulose, ethyl cellulose and cellulose acetate; powdered tragacanth; malt; gelatin; talc; excipients such as, but not limited to, cocoa butter and suppository waxes; oils such as, but not limited to, peanut oil, cottonseed oil, safflower oil, sesame oil, olive oil, corn oil and soybean oil; surfactants such as, but not limited to, cremophor EL, cremophor RH 60, Solutol HS 15 and polysorbate 80; cyclodextrins such as, but not limited to, alpha-CD, beta-CD, gammaCD, HP-beta-CD, SBE-beta-CD; glycols; such as propylene glycol; esters such as, but not limited to, ethyl oleate and ethyl laurate; agar; buffering agents such as, but not limited to,PROF 43950.601
[0184] magnesium hydroxide and aluminum hydroxide; alginic acid; pyrogen-freewater; isotonic saline; Ringer’s solution; ethyl alcohol, and phosphate buffer solutions, as well as other nontoxic compatible lubricants such as, but not limited to, sodium lauryl sulfate and magnesium stearate, as well as releasing agents, coating agents, preservatives and antioxidants can also be present in the composition, according to the judgment of the formulator.
[0185] In some embodiments, the excipient or carrier is pharmaceutically acceptable.
[0186] Pharmaceutically acceptable carriers, including buffers, are well known in the art, and may comprise phosphate, citrate, and other organic acids; antioxidants including ascorbic acid and methionine; preservatives; low molecular weight polypeptides; proteins, such as serum albumin, gelatin, or immunoglobulins; amino acids; hydrophobic polymers; monosaccharides; disaccharides; and other carbohydrates; metal complexes; and / or non-ionic surfactants. See, e.g., Remington: The Science and Practice of Pharmacy 20th Ed. (2000) Lippincott Williams and Wilkins, Ed. K. E. Hoover.
[0187] The carrier may include a delivery vehicle. Delivery vehicles such as nanoparticle- and lipid-based delivery systems can be used. Exemplary delivery vehicles include, but are not limited to, microparticle compositions comprising a variety of polymers, liposomes or lipid nanoparticles, viral vectors, ribonucleoprotein (RNP) complexes, and the like.
[0188] The route by which the disclosed engineered Cas proteins are administered and the form of the composition will dictate the type of carrier to be used. The composition may be in a variety of forms, suitable, for example, for systemic administration (e.g., oral, rectal, nasal, sublingual, buccal, implants, or parenteral injections) or topical administration (e.g., dermal, pulmonary, nasal, aural, ocular, liposome delivery systems, or iontophoresis).
[0189] The disclosure also provides methods of modifying a target nucleic acid sequence. The phrase “modifying a nucleic acid sequence,” as used herein, refers to modifying at least one physical feature of a nucleic acid sequence of interest or a functional feature of a nucleic acid sequence. In some embodiments, the nucleic acid alterations include, for example, single or double strand breaks, deletion, or insertion of one or more nucleotides, and other modifications that affect the structural integrity or nucleotide sequence of the nucleic acid sequence.
[0190] In some embodiments, the methods introduce a single strand or double strand break in the target nucleic acid sequence. In this respect, the disclosed systems may direct cleavage of one orPROF 43950.601
[0191] two strands of a target nucleic acid sequence, such as within a target genomic DNA sequence and / or within the complement of the target sequence.
[0192] In some embodiments, altering a nucleic acid sequence comprises a deletion. The deletion may be upstream or downstream of Cas protein binding site, so called unidirectional deletions. The deletion may encompass sequences on either side of the binding site, a bidirectional deletion. The deletion of the nucleic sequence may be of any size. The methods can be used to delete nucleic acids from a target sequence in a host cell by cleaving the target sequence and allowing the host cell to repair the cleaved sequence. Deletion of a nucleic acid sequence in this manner can be used in a variety of applications, such as, for example, to remove disease-causing trinucleotide repeat sequences in neurons, to create gene knock-outs or knockdowns, and to generate mutations for disease models in research.
[0193] In some embodiments, the systems and methods described herein may be used to insert a gene or fragment thereof into a cell. For example, the systems or methods may include an exogenous nucleic acid molecule which encodes a gene protein (e.g., a nucleic acid encoding a gene or gene product) which is inserted at the site of nucleic acid cleavage.
[0194] In some embodiments, the methods do not introduce a single strand or double strand break in the target nucleic acid sequence. The methods may result in modifying the nucleic acid sequence as a result of one of more of the effector or functional domains as described above. For example, the methods may modulate the transcription of a target nucleic acid, may add or remove moieties from the target nucleic acid (e.g., methyl groups), may edit bases in the target nucleic acid (e.g., deaminate, depurinate, depyrimidinate), may unwind, replication, of combine target nucleic acids, and / or may add or remove moieties from histones (e.g., methylate, demethylate, acetylate, deacetylate, ubiquitinate, phosphorylate, sumoylate) bound to the nucleic acid.
[0195] The methods comprise contacting a target nucleic acid sequence with an engineered Cas protein, fusion protein, composition, or system as described herein. In some embodiments, contacting a target nucleic acid sequence comprises introducing the engineered Cas protein, fusion protein, composition, or system into the cell. The engineered Cas protein, fusion protein, composition, or system may be introduced into eukaryotic or prokaryotic cells by methods known in the art, as described elsewhere herein.PROF 43950.601
[0196] The cell may be a prokaryotic cell, a plant cell, an insect cell, a vertebrate cell, an invertebrate cell, an animal cell, a mammalian cell, or a human cell. In some embodiments, the cell is a stem cell. In some embodiments, the cell is ex vivo (e.g., fresh isolate - early passage). In some cases, the cell is in vivo. In some cases, the cell is in culture or in vitro (e.g., immortalized cell line). Cells may be from established cell lines or they may be primary cells, where “primary cells,” “primary cell lines,” and “primary cultures” are used interchangeably herein to refer to cells and cells cultures that have been derived from a subject and allowed to grow in vitro for a limited number of passages of the culture. For example, primary cultures are cultures that may have been passaged 0 times, 1 time, 2 times, 4 times, 5 times, 10 times, or 15 times, but not enough times go through the crisis stage. Typically, the primary cell lines are maintained for fewer than 10 passages in culture.
[0197] In some embodiments, introducing the engineered Cas protein, fusion protein, composition, or system into a cell comprises administering the engineered Cas protein, fusion protein, composition, or system to a subject. In some embodiments, the subject is human. The administering may comprise in vivo administration of the engineered Cas protein, fusion protein, composition, system, or a nucleic acid encoding the engineered Cas protein, fusion protein, or system. In alternative embodiments, an in vitro or ex vivo treated cell is transplanted into a subject.
[0198] In some embodiments, the target nucleic acid is a nucleic acid endogenous to a target cell. In some embodiments, the target nucleic acid is a genomic DNA sequence. The term “genomic,” as used herein, refers to a nucleic acid sequence (e.g., a gene or locus) that is located on a chromosome in a cell.
[0199] In some embodiments, the target nucleic acid encodes a gene or gene product. The term “gene product,” as used herein, refers to any biochemical product resulting from expression of a gene. Gene products may be RNA or protein. RNA gene products include non-coding RNA, such as tRNA, rRNA, microRNA (miRNA), and small interfering RNA (siRNA), and coding RNA, such as messenger RNA (mRNA). In some embodiments, the target nucleic acid sequence encodes a protein or polypeptide.
[0200] The present methods may be used in various bacterial hosts, including human pathogens that are medically important, and bacterial pests that are key targets within the agricultural industry, as well as antibiotic resistant versions thereof. The method may be designed to targetPROF 43950.601
[0201] any gene or any set of genes, such as virulence or metabolic genes, for clinical and industrial applications in other embodiments. The present methods may be used to inactivate microbial genes. In some embodiments, the gene is an antibiotic resistance gene. The present methods may be used to treat a multi-drug resistance bacterial infection in a subject. The present methods may also be used for genomic engineering within complex bacterial consortia.
[0202] The methods described here also provide for treating a disease or disorder in a subject. The method may comprise administering to the subject, in vivo, an effective amount of the engineered Cas protein, fusion protein, composition, or system, or by transplantation of ex vivo treated cells. A “subject” or “patient” may be human or non-human and may include, for example, animal strains or species used as “model systems” for research purposes, such a mouse model as described herein. Within the context of the present disclosure, the term “effective amount” refers to that quantity such that modification of the target nucleic acid is achieved.
[0203] In some embodiments, the systems and methods target one or more “disease-associated” genes. The term “disease-associated gene,” refers to any gene or polynucleotide whose gene products are expressed at an abnormal level or in an abnormal form in cells obtained from a disease-affected individual as compared with tissues or cells obtained from an individual not affected by the disease. A disease-associated gene may refer to a gene, the mutation or genetic variation of which is directly responsible or is in linkage disequilibrium with a gene(s) that is responsible for the etiology of a disease. In another embodiment, the target genomic DNA sequence can comprise a gene, the mutation of which contributes to a particular disease in combination with mutations in other genes.
[0204] When utilized as a method of treatment, the effective amount may depend on the particular condition being treated, the severity of the condition, the individual patient parameters including age, physical condition, size, gender and weight, the duration of the treatment, the nature of concurrent therapy (if any), the specific route of administration and like factors within the knowledge and expertise of the health practitioner. In some embodiments, the effective amount alleviates, relieves, ameliorates, improves, reduces the symptoms, or delays the progression of any disease or disorder in the subject. In some embodiments, the subject is a human.
[0205] In the context of the present disclosure insofar as it relates to any of the disease conditions recited herein, the terms “treat,” “treatment,” and the like mean to relieve or alleviatePROF 43950.601
[0206] at least one symptom associated with such condition, or to slow or reverse the progression of such condition. Within the meaning of the present disclosure, the term “treat” also denotes to arrest, delay the onset (e.g., the period prior to clinical manifestation of a disease) and / or reduce the risk of developing or worsening a disease. For example, in connection with cancer the term “treat” may mean elimination or reduction of a patient's tumor burden, or prevention, delay, or inhibition of metastasis, etc.
[0207] The methods disclosed herein are also applicable to plants. For example, the methods can be used to generate novel engineered plants to improve agronomic traits, for example, herbicidal resistance, resistance to environmental stress, resistance to pests, etc.
[0208] The disclosed engineered Cas proteins, fusion proteins, compositions, and systems can be introduced into a plant, or a plant cell, seed, fruit, plant part, or propagation material of the plant. The term “plant propagation material” refers to generative parts of a plant, which can be used for the multiplication of the plant, and vegetative plant material such as cuttings and tubers (e.g., potatoes). In some embodiments, the propagation material is a root, a corm, a tuber, a bulb, a slip, a cutting of the plant, and a rhizome. Parts of a plant are any sections of a plant (e.g., roots, cotyledons, tendrils, leaves, flowers, seeds, stems, callus tissue, nuts, and fruit) that develop from a plant propagation material or grow at a later time. The methods described herein can be used on any plant part. Examples of plant parts include but are not limited to the root, corm, tuber, bulb, slip and rhizome.
[0209] Methods of introducing exogenous nucleic acids into plant cells are well known in the art. Such plant cells are considered transformed. DNA constructs can be introduced into plant cells by various methods, including, but not limited to PEG- or electroporation-mediated protoplast transformation, tissue culture or plant tissue transformation by biolistic bombardment, or the Agrobacterium-mediated transient and stable transformation.
[0210] The transformation can be a transient or a stable transformation. As used herein, the term “stable transformation” is intended to mean that the nucleotide construct introduced into a plant integrates into the genome of the plant and is capable of being inherited by the progeny thereof. “Transient transformation” is intended to mean that a polynucleotide is introduced into the plant and does not integrate into the genome of the plant or a polypeptide is introduced into a plant. In select embodiments, the nucleic acid encoding the RNA hairpin may be stably integrated into the plant genome, for example via Agrobacterium-mediated transformation.PROF 43950.601
[0211] As such, the disclosure also provides plants and plant propagation materials (e g., plant cell, seed, fruit, or plant parts) produced using the methods disclosed herein. Genetically modified, transformed, or transgenic plants include plants into which an exogenous polynucleotide, e.g., a polynucleotide encoding the engineered Cas protein or fusion protein disclosed herein, has been introduced.
[0212] The methods disclosed herein are suitable for use with any plant, for example, grain crops, fruit crops, forage crops, root vegetable crops, leafy vegetable crops, flowering plants, conifers, trees, oil crops, plants used in phytoremediation, industrial crops, medicinal crops, laboratory model plants, and the like. As such, non-limiting examples of plants that may be used with the present methods include: grains, forage crops, fruits, vegetables, oil seed crops, palms, forestry, vines, maize (corn, Zea mays), banana, peanut, field peas, sunflower, tomato, canola, tobacco, wheat, barley, oats, potato, soybeans, cotton, carnations, sorghum, lupin, rice, rutabaga, celery, switchgrass, apple, petunias, Arabidopsis thaliana, Medicago truncatula, Medicago sativa, Brachypodium distachyon, Nicotiana benthamiana, or Setaria viridis.
[0213] EXAMPLES
[0214] Example 1
[0215] Landscape of PAM specificity across CRISPR-Cas systems To overcome a lack of comprehensive datasets for CRISPR-Cas PAMs, extensive data mining was conducted to build the CRISPR-Cas Atlas (FIG. 1A) (Ruffolo et al. 2025 Nature, 645(8080), 518-525). This involved searching 26.2 Tbp of assembled microbial genomes and metagenomes, spanning diverse phyla and biomes, to uncover 1,246,163 CRISPR-Cas operons. Among these included 653,991 operons from Types I, II, and V where an effector protein and CRISPR array were confidently identified. Types I, II, and V are DNA-targeting systems that utilize a PAM during target interference, in contrast to Types III and VI which target RNA and avoid self immunity using mechanisms that don’t involve a PAM. Type IV systems which specifically target plasmids, were not chosen as a focus due to the relatively low number of CRISPR operons identified.
[0216] To characterize PAM specificity, over 13 million CRISPR spacers were aligned to a database of over 16 million virus and plasmid genomes. This resulted in PAM predictions for 45,816 unique proteins (Type I: n=28,410 Type II: n=15,731 Type V: n=l,675), covering 71.6%PROF 43950.601
[0217] of CRISPR-Cas operons (FIG. IB). Each PAM prediction was supported by at least 10 mapped spacers, the minimum required for a low-variance estimate of the PAM, and subjected to noise filtering to remove low quality predictions. This dataset represents a 2.8x increase over the largest dataset of bioinformatically determined PAMs (Ciciani M, et al. Nat Common. 2022;13: 6474) and a ~200x increase over the largest dataset of experimentally determined PAMs Overall, 1,666 unique consensus PAMs were identified, with Type II systems having the highest PAM diversity and representing 81.6% of PAMs (Type I, n=284; Type II, n=l,360; Type V, n=70). The discovery of new PAMs appears to have plateaued for most CRISPR subtypes, indicating that the dataset captured the majority of PAM diversity in nature (FIG. 1C). The exception was Type V systems, which had the lowest PAM prediction rate (FIG. IB) and PAM predictions were determined for only four of fifteen literature reported subtypes (FIG. 1C).
[0218] In addition to being exceptionally diverse, PAMs for Type II systems rapidly changed over short evolutionary distances (FIGS. ID- IE), in contrast to Type I and Type V which were highly conserved. Phylogenetic trees of Cas8 and Casl2 confirmed high levels of PAM conservation (FIG. 5). While not fully understood, this difference in PAM flexibility likely reflects distinct evolutionary pressures on each CRISPR-Cas system, enabling Cas9 to rapidly adapt and counter evolving threats such as phages and mobile genetic elements.
[0219] Example 2
[0220] Machine learning framework to predict PAM specificity from Cas proteins The relationship between Cas proteins and their PAMs was modeled (FIGS. 2A-2B). For each CRISPR-Cas type, the protein family responsible for PAM recognition during target interference was selected: Cas8 for Type I (or CaslOd for Type I-D), Cas9 for Type II, and Cas 12 for Type V. The PAM was modeled as a sequence of 10 probability vectors over the nucleotides A, C, G, and T, located either upstream (Types I and V) or downstream (Type II) of the guide sequence. This approach assumed that the nucleotides in the PAM are conditionally independent, given the protein sequence, as experimental evidence shows that specific residues in the protein interact with individual nucleotides within the PAM.
[0221] Each Protein2PAM model took a Cas protein sequence as input and predicted the probability distribution for each nucleotide at 10 PAM positions. The Protein2PAM model architecture consisted of a pre-trained 650-million-parameter ESM-2 transformer encoder, followed by a 2-layer multi-layer perceptron (MLP) head responsible for predicting PAMPROF 43950.601
[0222] nucleotide probabilities (FIG. 2A). The transformer captured key dependencies between amino acid residues relevant for PAM recognition, and the [CLS] token from the encoder’s final layer was passed to the MLP, which output a 10 x 4 matrix representing the predicted nucleotide probabilities at each PAM position. An encoder-decoder framework trained with a next-token cross-entropy loss was also evaluated, but it was slower and less accurate than the above described architecture. Protein2PAM models for Type I and V rapidly converged to their minimum loss, while the more variable PAM recognition in Type II systems led to longer training times and a higher final loss (FIG. 6). Alongside the main prediction models, auxiliary models were developed to distinguish high- from low-confidence PAM predictions. These models leverage protein language model embeddings and sequence identity to the training dataset to estimate prediction reliability (FIGS. 2B and 7).
[0223] Protein2PAM neural models demonstrated high accuracy in predicting PAMs for diverse CRISPR-Cas systems, with Cosine similarity scores of 0.949 for Type I, 0.868 for Type II, and 0.955 for Type V systems (FIG. 2C). Accuracy scores were measured using the cosine similarity between PAMs predicted by the model and PAMs held out from the training dataset.
[0224] Protein2PAM models were considerably more accurate than a baseline method that predicted PAMs based on the nearest protein sequence in the training set (FIG. 7). For proteins with less than 90% sequence identity to the training data, the Type II model showed a drop in accuracy, while the Type I and V models remained relatively stable (mean accuracies of 0.727, 0.912, and 0.904, respectively; FIG. 2C), reflecting the dynamics observed during model training.
[0225] For comparison, a structure-based approach for PAM prediction was explored.
[0226] Specifically, ProseLM, a structure-conditioned protein language model, was tested for capability of identifying the correct PAM for 16 experimentally characterized Cas9s. Overall, ProseLM assigned similar likelihoods to structure models of Cas9 bound to the correct versus incorrect PAM, recovering the true PAM for only 4 of 16 orthologs (FIG. 12). These results suggest that a structure model, without explicit training on PAM recognition, is insufficient to capture the subtle nucleoti de-recognition features essential for PAM prediction.
[0227] The optimal input sequence for modeling was investigated (FIG. 7). In Type II systems, where PAM recognition is primarily mediated by the PAM-interacting domain (PID), a custom Hidden Markov Model (HMM) database was used to identify PID regions and a separate model was trained on these sequences. In Type I systems, evidence suggests that Cas5 may alsoPROF 43950.601
[0228] contribute to PAM recognition. However, including Cas5 alongside Cas8 / 10d in the model reduced accuracy, especially for sequences more distant from the training data. Based on these results, the Cas8 / 10d-only model for Type I, the PID-only model for Type II, and the fullsequence model for Type V were selected, as these configurations demonstrated the best generalization to new data.
[0229] Example 3
[0230] In silica mutagenesis pinpoints protein-PAM interactions Concordance with Experimentally Determined PAMs
[0231] To more robustly evaluate Protein2PAM, the models were applied to Cas proteins with experimentally determined PAMs (FIGS. 2D-2F). The models were first applied to 14 diverse Type I CRISPR and CAST systems experimentally characterized by Wimmer et al. (Mol Cell.
[0232] 2022;82: 1210-1224. e6). Using Cas8 proteins as input, Protein2PAM successfully recapitulated PAMs for every active Type I system in the study, including subtypes I-B, I-C, I-E, and I-F and including proteins with as low as 25% amino acid identity to a training sequence.
[0233] PAMs were predicted for three groups of sequences on 112 Type II systems: 79 diverse Cas9 orthologs, 23 closely related Type II-C Cas9 orthologs, and 10 Cas9 proteins used in genome editing. For these datasets, Protein2PAM achieved median PAM accuracies of 0.767, 0.759, and 0.808, respectively, with an overall median prediction accuracy of 0.797 (FIG. 2D). Protein2PAM confidence scores closely tracked prediction accuracy (Spearman’s r = 0.649, p = 1.03 x 10‘10; FIG. 2E), and filtering out 58 low-confidence predictions improved the overall median accuracy to 0.883. Protein2PAM was less accurate for proteins distant from the training set (<70% sequence identity; n=36), achieving a median prediction accuracy of 0.638, with only 30% of Cas9 proteins in this group receiving highly accurate PAM predictions; nevertheless, this performance may still be sufficient to guide experimental screening.
[0234] Overall, these results highlight the model’s robust performance for Type I and II systems and demonstrate its ability to match experimental outcomes despite being trained exclusively on evolutionary data.
[0235] In contrast, for Type V systems, model performance was mixed (FIGS. 2D and 9).
[0236] Protein2PAM was tested on 45 proteins from 11 Casl2 subtypes characterized in separate studies. Only 12 proteins from three subtypes were within 40% identity of any training sequence, highlighting the rarity of many Cas 12s and their CRISPR targets in nature. Protein2PAMPROF 43950.601
[0237] performed moderately well for Casl2b and Casl2f (mean accuracy = 0.745, n = 14) but was less accurate for Casl2a (mean accuracy = 0.573, n = 18). Notably, for Casl2a the model tended to over-predict TTTN PAMs, likely due to an overrepresentation of these in the training dataset (FIG. 5).
[0238] Protein2PAM was tested on 20 engineered Cas9 and Casl2 proteins with altered PAM specificities from 10 studies. In most cases, the model predicted the same PAMs as their wildtype counterparts with the exception of an Nme2Cas9 variant where Protein2PAM correctly predicted a shift from N4CC to N4CN. Because Protein2PAM was trained on natural CRISPR-Cas systems the model may be insensitive to engineered mutations not observed in natural populations.
[0239] Example 4
[0240] Protein2PAM outperforms spacer-based PAM prediction
[0241] The performance of Protein2PAM was compared with PAMpredict, the most accurate bioinformatics tool for PAM prediction. Both tools were applied to predict PAMs for 11,381 Cas operons identified using CRISPRCasTyper, derived from genomic and metagenomic datasets not used for model training (FIGS. 3 A-3B). Protein2PAM predicted PAMs using protein sequences (Cas8, Cas9, and Cas 12), while PAMpredict relied on CRISPR spacers aligned to a database of viral and plasmid genomes.
[0242] Protein2PAM confidently predicted PAMs for 91.9% of 7,812 CRISPR-associated Cas operons, while PAMpredict confidently predicted PAMs for only 30.9% (FIG. 3C). The largest difference was observed for Type V systems, where Protein2PAM was over 16 times more likely to yield a high-confidence prediction (72.5% vs. 4.4%) primarily due to insufficient spacer matches in the viral database using PAMpredict. Protein2PAM additionally provided 3,065 high-confidence predictions for Cas operons without associated CRISPR arrays and overall produced 4.2 times more high-confidence predictions than PAMpredict (FIG. 3D). While pooling spacers from closely related CRISPR systems can improve PAMpredict’ s sensitivity, this approach requires large-scale data mining making it impractical for most researchers. Furthermore, it was found that Protein2PAM predictions closely aligned with those of PAMpredict when both tools reported high confidence in their respective predictions (FIG. 3E).
[0243] Lastly, the running times and computational requirements of both approaches were compared. PAMpredict was run on a Google Cloud instance with 16 vCPUs, taking 2.7 days toPROF 43950.601
[0244] process 7,812 CRISPR-Cas operons (FIG. 3G). In contrast, Protein2PAM was run on a Google Cloud instance with one NVIDIA T4 GPU and completed the analysis of 11,381 Cas operons in just 5.9 minutes. Generating confidence estimates extended Protein2PAM’s runtime to 59.8 minutes. Together, these results demonstrate that Protein2PAM aligns with the current gold standard for PAM prediction, offers greater sensitivity, is independent of CRISPR spacer identification, and is considerably faster.
[0245] Example 5
[0246] In silica mutagenesis pinpoints protein-PAM interactions
[0247] To evaluate the potential of PAM models for protein engineering, in silica mutagenesis was performed to predict point mutations that could alter PAM specificity (FIG. 4A). Previous studies have identified PAM-interacting residues from Cas9 crystal structures bound to target DNA or through experimental evolution of Cas9 variants capable of recognizing alternate PAMs. The PAM models herein offer an alternative to these structural and experimental methods, enabling rapid exploration of large mutational landscapes to identify protein-to-PAM interactions across diverse Cas proteins.
[0248] To comprehensively map the landscape of PAM interactions in Cas9, the models were retrained, incorporating in vitro characterized PAMs into the training data. The full-sequence PAM models were then used to predict the effect of over 8 million single amino acid substitutions on 336 phylogenetically diverse Cas9s including 15 previously used as genome editors. PAM-specifying mutations (PSMs) were defined as mutations causing a shift of >0.5 bits at one or more nucleotide positions. The vast majority of mutations had no effect on the PAM, with only 3,580 (0.04%) classified as a PSM (FIG. 4B). Visualization of the mutations revealed that large-effect mutations clustered within the PI domain and stood out as clear outliers relative to neighboring sites (FIG. 4D). Overall, 99.98% of PSMs were located within the annotated PI domain (FIG. 4B) even though this region comprised only 23.7% of the cumulative protein length. This suggests that the full-sequence models do not utilize information outside the PI domain, reaffirming the PI domain's role in determining PAM specificity.
[0249] In contrast to Cas9, point mutations in Cas8 and Cas 12 had minimal impact on the predicted PAM (FIG. 4B). After testing over 500,000 mutations across 40 phylogenetically diverse Cas8 and Casl2 proteins, only two PSMs were predicted. However, both mutations were located at the same amino acid position and were just above the threshold for classifying aPROF 43950.601
[0250] mutation as a PSM. By contrast, PSMs were found for 70% of the 336 Cas9 proteins screened in silico. These findings highlight that PAM customization with Protein2PAM is limited for Type I and V systems due to high PAM conservation (FIG. 5) but may be effective for Type II systems.
[0251] To obtain higher resolution of the Type II PAM-interaction landscape, the PAM model trained specifically on Cas9’s PAM-interacting domain (PID-only model) was applied.
[0252] Compared to the full-sequence model, the PID-only model predicted 12.9% more PSMs and 30.5% more alternate PAMs (FIG. 4B). Next, this model was used to test all possible single amino acid insertions and deletions within the PI domain. However, indels were less effective than substitutions for PAM diversification (FIG. 4B) and were predicted to result in reduced fitness (FIG. 10). Interestingly, the in silico screen with Protein2PAM identified several Cas9 orthologs that appeared highly amenable to engineering with single substitutions. Point mutations resulted in at least eight alternate PAMs for well-studied Cas9s like NmelCas9 (Amrani et al. 2018 Genome Biol., 19(1): 1-25) and AceCas9 (Tsui et al. 2017 ACS Synth. Biol., 6(6): 1103-1113), as well as for five novel Type II-A and Type II-C Cas9s identified from human microbiome samples (FIG. 4E).
[0253] Next, it was examined whether any patterns emerged among the predicted PSMs. Several amino acids were highly overrepresented among PSMs (FIG. 4C), such as glutamine and arginine, which were 3.4x and 1.9x more likely to be found at a PSM position compared to the rest of the PI domain (X2q-values < 5e-6). Furthermore, strong preferences were observed between glutamine, arginine, and aspartic acid and specific nucleotides (FIG. 4C), consistent with previously observed amino-acid nucleotide interactions, including the propensity for glutamine in Cas9 to recognize adenine in the major groove of DNA.
[0254] Protein2PAM predictions were compared to eight Cas9 proteins with crystal structures. Strikingly, many of the top-ranked mutations identified by Protein2PAM occurred at amino acids that form sequence-specific contacts with PAM DNA (FIG. 11). In total, 58.5% of the 159 identified PSMs were located at residues that form hydrogen bonds with PAM nucleotides (X2p-value < 2.2e-16), and this percentage increased to 80.0% when considering only the top 50 PSMs with the largest effect. Notably, PSMs were not found at all PAM-interacting positions. For example, in SpCas9, Argl333 forms a critical interaction with the guanine at the second PAM position (NGG), but due to its high conservation across orthologs, mutations here did not alter PAM recognition and were not highlighted by the model. Together, these findings suggest thatPROF 43950.601
[0255] the herein described evolutionary-informed models accurately capture the key biophysical interactions governing protein-to-PAM recognition across diverse Cas9 proteins.
[0256] Several NmelCas9 variants containing a PAM-specifying mutation(s) predicted by the model showed activity at their expected targets.
[0257] Protein2PAM models were used to design variants for specific user-defined PAMs.
[0258] While NmelCas9 has been used for genome editing due to its small size and high specificity, its long PAM (N4GNTT) limits the number of editable sites in the human genome. Protein2PAM was used to engineer NmelCas9 variants with broader PAM compatibility, specifically targeting three single-nucleotide PAMs (N4G, N4C, N7A) and three di-nucleotide PAMs (NMT, Nr>TA, N4CNNT). These targets were chosen after examining PAM nucleotide conservation patterns across a large number of Nme orthologs. These design objectives were not attainable through single point mutations, prompting employment of the Gibbs with Gradients Markov Chain Monte Carlo (MCMC) algorithm as a means of designing combinatorial variants. MCMC enabled iterative introduction of in silico mutations within the PI domain, steering each protein variant toward its target PAM. To increase sensitivity, a variant of the Protein2PAM model in which NmeCas9 orthologs were upweighted was trained and utilized. To preserve enzymatic function, candidate mutations were sampled from a multiple sequence alignment of closely related Nme orthologs and those that significantly reduced the protein language model loglikelihood were down-weighted, as measured by ProGen2 fine-tuned on the CRISPR-Cas Atlas. MCMC trajectories were terminated after 2000 steps, and variants at all points along the trajectories were considered.
[0259] The methods generated 340 variants and the top 22 variants targeting the six PAMs were selected. Top variants were selected based on their predicted similarity to the target PAM, protein language model log-likelihood, and number of mutations. Each variant contained an average of 11.6 mutations, with a range of 5 to 18. As above, a number of these NmelCas9 variants containing multiple mutations predicted by the model showed activity at their expected targets.
[0260] Protein2PAM successfully designed highly active variant enzymes with significant shifts in PAM specificity, aligning closely with model predictions. Protein2PAM can efficiently engineer novel PAMs for Cas enzymes without the need for large-scale experimental data, iterative screening, or reliance on protein structures and molecular modeling.PROF 43950.601
[0261] Beyond Nmel , the same Protein2P AM-guided MCMC protocol was applied to ten phylogenetically diverse Cas9 orthologs targeting alternate PAMs, achieving in silico success for 95 of these targets (63%), demonstrating that the approach generalizes across Cas9 families (FIG. 13).
[0262] Engineered Cas variants with broadened PAMs can suffer from reduced activity in human cells. To address this concern, the editing efficiency for NmeN4G.l and NmelS C.l was assayed alongside SpCas9 at a set of 12 target sites compatible with PAMs for SpCas9 and each enzyme variant. Nuclease and sgRNA expressing plasmids were co-transfected in HEK293T cells, and DNA repair outcomes were measured after three days using next-generation sequencing of amplicons (NGS). Across genomic targets, Protein2P AM-engineered variants showed mean indel rates similar to SpCas9 (NmeN4G.l: 52.2% vs 55.2%, p = 0.41; NmeN4C.l: 47.3% vs 52.2%, p = 0.76; Wilcoxon signed-rank test), indicating that broadening PAM recognition did not compromise nuclease activity in human cells. In contrast to Protein2P AM-engineered variants, the directed evolution variant eNme2-C (Tony P Huang, et al., Nat.
[0263] Biotechnol., 41(l):96-107, January 2023) showed substantially lower editing activity despite its expanded PAM compatibility (FIG. 14).
[0264] Materials and Methods
[0265] Curation of CRISPR-Cas sequences
[0266] Cas proteins and their associated CRISPR arrays were identified from the CRISPR-Cas Atlas, as previously described (Ruffolo et al. 2025 Nature, 645(8080), 518-525). The resource contains 1,246,163 CRISPR-Cas operons that were derived from 26.2 Tbp of genome and metagenomic assemblies. For modeling PAMs, the focus was on a subset of 653,991 operons from CRISPR Types I, II, and V where an effector protein could be confidently linked to a CRISPR array.
[0267] For Cas9 proteins, PI domains were also identified using a custom-built database of 123 profile HMMs. PI domain sequences were sourced for 9,161 diverse proteins (Gasiunas et al. 2020 Nature Communications 11 (1): 5512), de-replicated at 90% identity using CD-HIT v4.8.1, aligned using DIAMOND v2.1.6 (options: --query-cover 80 --subject-cover 80 -very-sensitive), and clustered using MCL v22.282 (options: -I 1.5). Multiple sequence alignments (MSAs) were created with FAMSA v2.2.2 and used as input to hmmbuild v3.4. HMMs were aligned to Cas9s from the CRISPR-Cas Atlas using hmmsearch with a le-5 E-value threshold.PROF 43950.601
[0268] For proteins lacking a valid PI domain alignment, the region downstream of RuvC TIT was instead extracted based on alignment to Pfam (PF18541).
[0269] PAM characterization
[0270] PAMs for CRISPR-Cas systems were characterized by aligning CRISPR spacers to viral and plasmid genomes and performing statistical analysis of regions flanking protospacers. To enhance the number of spacers associated with each Cas ortholog, CRISPR arrays were pooled from closely related Cas proteins. Cas proteins included Cas8 for Type I systems, Cas9 for Type II systems, and Casl2 for Type V systems. Proteins were clustered using MMseqs2 13.45111 with default parameters at either 100%, 99%, and 98% amino acid identity (see below for details).
[0271] Each pool of spacers contained CRISPR arrays in varying orientations. To address this, CRISPR repeats associated with each Cas protein cluster were aligned using CD-HIT (options: cd-hit-est -c 0.95 -s 1.0) and CRISPR spacers were consistently oriented based on the orientation of aligned repeats. CD-HIT was also used to de-replicate CRISPR spacers within each cluster to minimize the impact of overrepresented sequences (cd-hit-est -c 0.90 -T 1 -s 0.90).
[0272] Oriented and de-replicated pools of CRISPR spacers were input to PAMpredict vl.0.2 (Ciciani et al. Nat Commun. 2022;13: 6474). This tool aligned spacers to a database of 1.6 million virus and plasmids genomes from IMG / VR v4 and IMG / PR, extracted 10-nt protospacer flanking regions, computed nucleotide frequencies, and identified sequence motifs. PAMs were detected upstream of protospacers for Type I and V systems and downstream for Type II systems. The strand of the PAM was determined based on the 10-nt region containing a conserved motif. A PAM was classified as high confidence based on two criteria. First, PAM was to be identified from at least 10 unique protospacers, following the recommendation of Ciciani et al. (Nat Commun. 2022;13: 6474). Second, PAM was classified as high confidence when having a signal-to-noise ratio greater than 2.0 (FIG. 5). For Type II systems, the signal-to-noise ratio was calculated as the ratio of the maximum information content across the 10 nucleotide positions upstream and downstream of the protospacer, and conversely, for Type I and Type V systems, the ratio was calculated in the opposite direction.
[0273] Each Cas protein was associated with multiple PAM predictions due to the varying MMseqs2 clustering thresholds. Clustering at lower identity thresholds increases the number ofPROF 43950.601
[0274] CRISPR spacers linked to a protein, improving the likelihood of PAM detection, but also increasing the chances of pooling Cas variants with different PAM specificities. To mitigate this, the PAM prediction was selected at the highest clustering threshold (% identity) that met the prediction quality criteria.
[0275] The PAM dataset generated herein was compared to two previously published studies. In the study by Ciciani et al., PAMs were bioinformatically quantified for Cas9 proteins clustered at 98% amino acid identity. Using this threshold, Ciciani et al. identified PAMs for 2,546 Cas9 protein clusters with at least 10 mapped spacers, whereas the PAM dataset herein had PAMs for 7,229 Cas9s clustered at 98% identity (2.8x increase). Similarly, Gasiunas et al. experimentally characterized PAMs for 79 unique Cas9 proteins, compared to the 15,731 unique Cas9 proteins with bioinformatically characterized PAMs herein (199x increase).
[0276] Training the Protein2PAM model
[0277] Both the Protein2PAM and PAM confidence models consist of a 650 million parameter transformer encoder with an MLP head, which has one hidden layer with embedding dimension 1280 (matching that of the transformer encoder). In all cases, the models were evaluated using 10-fold cross validation, ensuring that the validation data came from different 90% identity clusters from the training data.
[0278] For Protein2PAM, the MLP head takes as input the [CLS] embedding vector from the transformer encoder, and has an output dimension of 40. The output is reshaped into a 10x4 matrix and transformed into a sequence of probability distributions over nucleotides with a softmax (FIG. 2A). The transformer encoder was initialized with the pretrained ESM-2 model, but its weights received gradient updates during training. Each model was trained to maximize the sum of the negative cross entropy and PAM similarity between true and predicted PAMs, using PyTorch Distributed Data Parallel on machines with 2 Al 00 GPUs. Each training batch contained up to 2500 tokens, and the gradient was accumulated for 4 steps before updating model weights. The Adam optimizer with a learning rate of 0.0001 (all other hyperparameters set to PyTorch defaults) was used. Training was stopped when the validation loss did not improve for 5000 steps, and the checkpoints with the best validation loss were used.
[0279] Estimating Protein2PAM prediction confidence
[0280] For Protein2PAM confidence estimation, the percent identity was first calculated between the input sequence and its 10 nearest neighbors in the training data. These 10 percentPROF 43950.601
[0281] identities were encoded into a 200-dimensional vector using piecewise linear embeddings (PLE), which was then projected into a 1280-dimensional space via a linear layer. This vector was added to the 1280-dimensional [CLS] embedding from the transformer encoder before passing the combined vector through a 2-layer MLP. A sigmoid activation was applied to the MLP output, constraining it to the [0, 1] range, where it could be interpreted as the predicted PAM similarity (FIG. 2B).
[0282] For each CRISPR-Cas type, a CasEncoder, a 650-million parameter transformer initialized from a pretrained ESM-2 checkpoint, was first trained. The CasEncoder was finetuned using the masked language modeling loss on proteins from the CRISPR-Cas database to learn a consistent representation of the relevant protein family. Once trained, the CasEncoder weights were frozen, and the proteins were encoded using their summary token embeddings. The percent identity between each sequence and the 10 most similar sequences in the training dataset were computed, these values were embedded using PLE and combined with the CasEncoder embeddings.
[0283] The combined embeddings were passed through a 2-layer MLP, which was trained by minimizing the mean squared error between the predicted PAM similarity and the accuracy of Protein2PAM’s prediction. The Adam optimizer with a learning rate of 0.0003 and a batch size of 1024 was used. The best-performing confidence model was selected based on the checkpoint with the lowest validation loss.
[0284] Quantifying PAM similarity
[0285] The similarity between two PAMs was quantified based on their information content rather than raw probability distributions. The information content of a probability distribution is measured using the relative entropy between P and a background distribution Q, where Q is uniformly distributed across A, C, G, and T. Specifically, the information content of nucleotide n at position i is calculated as:
[0286]
[0287] Given two 10x4 PAM information matrices, I(1)and I(2), the cosine similarity between their vectorized forms provides a natural similarity metric. However, this metric fails toPROF 43950.601
[0288] distinguish between positions where one PAM has low information (denoted as N) and the other has high information.
[0289] To address this, each position in the information matrix was augmented with the information content of a fictitious N nucleotide. This N content is high when the original PAM has low information at that position but the comparison PAM has high information, and low when both PAMs have either high or low information.
[0290] <>
[0291] >
[0292]
[0293] Finally, the cosine similarity between the vectorized forms of the augmented information matrices, J(1)and J(2), was computed to obtain the PAM similarity. This augmented similarity metric is referenced in the disclosed Examples.
[0294] Benchmarking on experimental datasets
[0295] Protein2PAM models were evaluated on experimentally determined PAMs for diverse CRISPR systems. For Type I systems, Protein2PAM was applied to 14 Cas8 proteins with characterized PAMs. For Type II systems, Protein2PAM was applied to three different datasets. First, Protein2PAM was applied to 79 Cas9 proteins spanning the phylogeny. Second, Protein2PAM was applied to 23 Cas9 proteins from closely related Type II-C systems. Lastly, Protein2PAM was applied to 10 Cas9 proteins used as genome editors, including: SpCas9 (Karvelis et al. 2015 Genome Biology 16 (November):253), NmelCas9 (Liu et al. 2022 Communications Biology 5 (1): 1-7), Nme2Cas9 (Liu et al. 2022 Communications Biology 5 (1): 1-7), StlCas9 (Karvelis et al. 2015 Genome Biology 16 (November):253), St3Cas9 (Karvelis et al. 2015 Genome Biology 16 (November):253), AceCas9 (Hand et al. 2019 Methods in Enzymology, 616:265-88), FnCas9 (Hisato Hirano, et al., 2016 Cell 164 (5): 950-61), FrCas9 (Cui et al. 2022 Nature Communications 13 (1): 1-12), CjCas9 (Kim et al. 2017 Nature Communications 8 (1): 1-12), and CdCas9 (Hirano et al. 2019 Nature Communications 10 (1): 1-11). For Type V systems, Protein2PAM was applied to 45 Casl2s with experimentally characterized PAMs, including: Casl2a (Zetsche et al. 2015 Cell 163 (3): 759-71 and Zetsche et al. 2020 Keio Journal of Medicine 69 (3): 59-65), Casl2b (Strecker, et al. 2019 Nature Communications 10 (1): 1-8), Casl2d (Harrington et al. 2020 Molecular Cell 79 (3): 416-24; Burstein et al. 2016 Nature 542 (7640): 237-41), Casl2f (Karvelis et al. 2020 Nucleic Acids Research 48 (9): 5016), Casl2h (Yan et al. 2019 Science, 363(6422), 88-91), Casl2i (Yan et al.PROF 43950.601
[0296] 2019 Science, 363(6422), 88-91), Casl2j (Wang et al. 2023 Science advances, 9(6), eabo6405), Casl2k (Strecker, et al. 2019 Science, 365(6448), 48-53), Casl21 (Urbaitis et al. 2022 EMBO Reports 23 (12): e55481), Casl2m (Wu et al. 2022 Molecular Cell 82 (23): 4487-4502.e7), and Cas-lambda (Al-Shayeb et al. 2022 Cell 185 (24): 4574-86. el 6). Protein2PAM was applied to 20 engineered proteins from the literature with altered PAM specificities, including variants of: SpCas9 (Nishimasu et al. 2018 Science 361 (6408): 1259-62; Walton et al. 2020 Science 368 (6488): 290-96; Kleinstiver, et al. 2015 Nature 523 (7561): 481-85), SaCas9 (Kleinstiver, et al.
[0297] 2015 Nature Biotechnology 33 (12): 1293-98), StlCas9 (Zhang et al. 2020 Nature Catalysis 3 (10): 813-23), Nme2Cas9 (T. P. Huang et al. 2023 Nature Biotechnology 41 (1): 96-107), CjCas9 (Schmidheini et al. 2024 Nature Chemical Biology 20 (3): 333-43), and Casl2a (Kleinstiver et al. 2019 Nature Biotechnology 37 (3): 276-82; Tran et al. 2021 Nucleic Acids 24 (June):40-53; Gao et al. 2017 Nature Biotechnology 35 (8): 789-92).
[0298] MCMC-based design of PAM customized Nmel variants
[0299] To design proteins that targeted specific PAMs, the Gibbs with Gradients Markov Chain Monte Carlo (MCMC) algorithm (Grathwohl et al. 2021 International Conference on Machine Learning, 3831-41. PMLR) was leveraged. MCMC provides a stochastic method that iteratively introduces in silico stochastic mutations to a protein sequence that are expected to improve its score according to an oracle model. Two components were averaged to compute a score for a protein sequence: the Protein2PAM loss between the sequence’s predicted PAM and a target PAM, and the language modeling loss of ProGen2 (Nijkamp et al. 2023 Cell Systems 14 (11): 968-78. e3) fine-tuned on the CRISPR-Cas Atlas (Ruffolo et al. 2025 Nature, 645(8080), 518-525).
[0300] To increase sensitivity, a variant of the Protein2PAM model was trained and utilized where NmeCas9 orthologs were upweighted in the training data. To preserve enzymatic function, only candidate mutations in the PI domain were sampled from a multiple sequence alignment of NmeCas9 orthologs that had at least 50% ID to NmelCas9. All MCMC trajectories were run for 2000 steps. For each target PAM, the variants at any point along the trajectories were selected for those which individually minimized the Protein2PAM loss, the fine-tuned ProGen2 model’s loss, and the aggregate score.
[0301] To design proteins targeting diverse PAMs, a combinatorial mutagenesis approach was adopted. A minimal set of 102 PSMs that shifted NmelCas9’s PAM preference by at least 0.25PROF 43950.601
[0302] bits was identified. Notably, 73 of these mutations were concentrated at seven key sites, including three (Q981, Hl 024, N1029) that form hydrogen bonds with PAM DNA in NmelCas9’s WT structure. Pairwise combinations of these 102 mutations yielded 9,464 double, 132,838 triple, 2,530,861 quadruple, and 34,777,000 quintuple mutants which were predicted to collectively target 177 alternative PAMs. Enzymes were selected to maximize PAM diversity, minimize mutation count, and maximize pLM log likelihoods as measured by ProGen2 finetuned on the CRISPR-Cas Atlas.
[0303] Cloning Nmel variants
[0304] pCMV-NmelCas9 -P2A-EGFP was synthesized by Twist Biosciences and designed to harbor the wild-type NmelCas9 sequence and serve as the expression plasmid. Computationally designed PID variants were synthesized by Twist Biosciences and ordered as an arrayed, lyophilized, double-stranded DNA fragment library. PID variant fragments contained two flanking regions with complementary overlap to the wild-type WED domain (57bp complementarity) and the pCMV plasmid (30bp complementarity) for downstream cloning. To generate the arrayed plasmid variant library, the expression plasmid was first linearized with inverse PCR to remove the wild-type PID with the following recipe: 25 pL 2x Platinum SuperFi II PCR master mix (Invitrogen), 10 pM forward primer, 10 pM reverse primer, 10 ng of template, and nuclease-free water to a final volume of 25 pL per reaction. The following PCR parameters were then applied: initial denaturation at 98°C for 30 s, followed by 15x cycles of denaturation at 98°C for 10 s, annealing at 60°C for 30 s, extension at 72°C for 4.2 min, and a final extension at 72°C for 5 min. The PCR linearized backbone was then incubated with Dpnl (NEB) at 37°C for 1 hr to digest the residual template.
[0305] PID variant fragments were resuspended to a concentration of 20ng / pl 50pl in 10 mM Tris-Cl, pH 8.5, then introduced into the linearized expression plasmid through Gibson HiFi assembly (NEB) with a 10: 1 insert to vector ratio. Reactions were incubated at 50°C for 20 min, then transformed into NEB® Turbo Competent E. coli cells. Colony PCR was performed to screen clones for proper assembly (REDTaq® Ready Mix, Sigma-Aldrich), and passing clones were mini-prepped (Qiagen) and confirmed with whole plasmid sequencing.
[0306] Screening on-target activity at human genomic loci
[0307] Indel rates for two highly active P AM-altered NmelCas9 variants, NmeN4G.l (SEQ ID NO: 18) and NmeN4C.l (SEQ ID NO: 10), as well as the eNme2-C experimentally-evolvedPROF 43950.601
[0308] literature variant (Tony P Huang, et al., Nat. Biotechnol., 41 (1 ):96— 107, January 2023) were characterized using three PAMs each enzyme (N4GCAA, N4GTAA, N4GAAA for NmeN4G.l; N4CCTA, N4CCTT, N4CCTG for NmeN4C.l). eNme2-C was screened at compatible targets selected for NmeN4C.1. Because these PAMs are incompatible with the NmelCas9 wild type (N4GATT), direct indel-rate comparisons at shared targets were not possible. Instead, editing efficiency was benchmarked against SpCas9, whose distinct NGG PAM enabled selection of overlapping chromosomal targets recognized by both SpCas9 and theNme variants. Target sites were drawn from dbGuide, yielding 24 loci (12 per Nme variant), and guides were designed with a 20-nt SpCas9 or 24-nt NmelCas9 sgRNA scaffold as appropriate.
[0309] Codon-optimized nuclease and sgRNA expression plasmids (Twist Biosciences) were transfected separately into HEK293T cells (ATCC, authenticated and mycoplasma-free). Cas9 constructs were driven by a CMV promoter and carried a C-terminal 1 x SV40 NLS and P2A-GFP tag for expression monitoring; sgRNAs were expressed from a human U6 promoter and encoded an RFP transfection reporter. Cells were maintained at 37° C and 5% CO2 in high-glucose DMEM with 4 mM L-glutamine, 1 mM sodium pyruvate, and 10% FBS. Twenty-four hours before transfection, 1.6* 104 cells / well were plated in 96-well plates. Each well received 50 ng sgRNA and 50 ng nuclease plasmid mixed with 0.4 pL TransIT® -VirusGen (Minis) in Opti-MEM and incubated 72 h.
[0310] Cells were lysed in 25 pL buffer (100 mM Tris-HCl pH 7.5, 0.05% SDS, 25 pg / mL Proteinase K) for 1 h at 37° C, followed by 25 pL nuclease-free water and 15 min at 98° C. Five microliters of lysate (~Ez104cells) served as template for a two-step PCR: first with locusspecific primers bearing Illumina adapter sequences (Invitrogen Platinum SuperFi II), then with outer primers appending unique 10-nt indices (Q5® Hot Start, NEB; xGen UDI plates, IDT).
[0311] Indexed amplicons were pooled and sequenced (NovaSeq X, 2^151 bp, Seqmatic) alongside reference, non-edited controls. Reads were quality-trimmed with fastp and analyzed with CRISPResso2 v2.2.24 using default parameters to quantify indel rates from processed reads, amplicon sequences, and spacer sequences.PROF 43950.601
[0312] Table 1. Sequences
[0313]
[0314] PROF 43950.601
[0315]
[0316] PROF 43950.601
[0317]
[0318] PROF 43950.601
[0319]
[0320] PROF 43950.601
[0321]
[0322] PROF 43950.601
[0323]
[0324] PROF 43950.601
[0325]
[0326] PROF 43950.601
[0327]
[0328] PROF 43950.601
[0329]
[0330] PROF 43950.601
[0331]
[0332] PROF 43950.601
[0333]
[0334] PROF 43950.601
[0335]
[0336] PROF 43950.601
[0337]
[0338] PROF 43950.601
[0339]
[0340] PROF 43950.601
[0341]
Claims
1. PROF 43950.601CLAIMSWe claim:
1. A method for predicting a Protospacer Adjacent Motif (PAM) sequence for a Cas protein, the method comprising determining the probability distribution for each nucleotide at two or more PAM positions for a Cas protein of interest using a machine learning model trained on a database of Cas protein sequences and corresponding PAM sequences.
2. The method of claim 1, wherein structural information of the Cas protein of interest is not input into the machine learning model.
3. The method of claim 1 or 2, wherein the machine learning model comprises a trained transformer encoder and a 2-layer multi-layer perceptron (MLP) head.
4. The method of any of claims 1-3, wherein Cas protein sequences comprise PAM-interacting domain (PID) sequences.
5. The method of any of claims 1-4, wherein the machine learning model is weighted with sequences similar to sequence of the Cas protein of interest.
6. The method of any of claims 1-5, further comprising generating one or more variants of the Cas protein of interest and predicting the PAM sequence for each of the one or more variants.
7. The method of claim 6, wherein the one or more variants comprise an amino acid sequence having one or more amino acid substitutions, deletions, or additions as compared to the Cas protein of interest amino acid sequence.
8. The method of claim 6 or 7, wherein the one or more variant sequences are generated iteratively over multiple rounds of generating and predicting, wherein predicted PAMs from each round are leveraged to target a desired PAM sequence.
9. The method of any of claims 1-8, further comprising distinguishing high-confidence and low-confidence PAM predictions.
10. The method of claim 9, wherein the distinguishing comprises quantifying confidence based on protein language model (pLM) embeddings and sequence identity as compared to sequences used to train the machine learning model.PROF 43950.60111. The method of any of claims 1-10, wherein the method further comprising identifying PAM-specifying mutations (PSMs) in the Cas protein of interest.
12. The method of any of claims 1-11, wherein the method further comprises predicting PAM-specifying mutations in the Cas protein of interest.
13. Non-transitory computer readable medium containing instructions that, when executed by at least one processor, perform the method of any of claims 1-12.
14. A system comprising:at least one processor; andat least one non-transitory computer readable medium of claim 13.
15. An engineered Cas protein having an amino acid sequence with at least 70% identity to SEQ ID NO: 1, wherein the engineered Cas protein comprises one or more amino acid substitution, deletion, and / or addition as compared to SEQ ID NO: 1, wherein the one or more amino acid substitution, deletion, and / or addition modify Protospacer Adjacent Motif (PAM) recognition and / or specificity.
16. The engineered Cas protein of claim 15, wherein the engineered Cas protein comprises one or more amino acid substitutions, deletions, or additions in the PAM-interacting domain.
17. The engineered Cas protein of claim 15 or 16, wherein the engineered Cas protein comprises one or more amino acid substitutions at positions: 938, 940, 954, 956, 957, 960, 966, 979, 980, 981, 982, 987, 989, 990, 995, 996, 1000, 1003, 1007, 1008, 1010, 1011, 1014, 1015, 1016, 1017, 1018, 1020, 1021, 1022, 1023, 1024, 1026, 1027, 1029, 1030, 1031, 1032, 1034, 1037, 1038, 1039, 1040, 1041, 1044, 1046, 1048, 1050, 1053, 1055, and 1056, relative to SEQ ID NO: 1.
18. The engineered Cas protein of any of claims 15-17, wherein the engineered Cas protein comprises one or more amino acid substitutions selected from: H938D or H938G; G940A;E954C; G956A; D957G; Y960H; S966A; V979I or V979C; V980K or V980I; Q981A, Q981T, Q981R, or Q981G; G982F or G982Y; D987N; Q989T; L990V; F995Y; N996Q, N996E, or N996K; S1000V; P1003S; V1007I; E1008K; I1010K or I1010T; T1011A; A1014N, A1014D, or A1014K; R1015S; M1016F, M1016K, M1016R, or M1016I; F1017L; G1018A; F1020Y;PROF 43950.601A 1021 S, A 1021 V, or A 10211; S 1022G or S 1022N; C 1023L or C 1023F; H 1024D or Hl 024E; G1026E, G1026S, or G1026A; T1027N or T1027D; N1029A, N1029G, N1029H, orN1029R; I1030F; N1031S orN1031D; I1032L; I1034T; L1037K; D1038E; H1039K or H1039N; K1040T or K1040S; I1041K or I1041V; N1044D; I1046V; E1048R or E1048Q; I1050V; K1053Q;A1055L; and L1056V, relative to SEQ ID NO: 1.
19. The engineered Cas protein of any of claims 15-18, wherein the engineered Cas protein comprises one or more amino acid substitutions at positions: 981, 1024, and 1029, relative to SEQ ID NO:
120. The engineered Cas protein of any of claims 15-19, wherein the engineered Cas protein comprises one or more amino acid substitutions selected from: Q981A; H1024D or H1024E; and N1029A or N1029G, relative to SEQ ID NO: 1.
21. The engineered Cas protein of claims 19 or 20, wherein the engineered Cas protein further comprises one or more amino acid substitutions at positions: 957, 982, 1026, and 1048.
22. The engineered Cas protein of claim 21, wherein the engineered Cas protein further comprises one or more amino acid substitutions selected from: D957G; G982F or G982Y, G1026S or G1026A; and E1048R orE1048Q.
23. The engineered Cas protein of any of claims 15-22, wherein the engineered Cas protein comprises an amino acid sequence with at least 70% identity to any of SEQ ID NOs: 2-31.
24. The engineered Cas protein of any of claims 15-23, wherein the engineered Cas protein comprises an amino acid sequence with at least 90% identity to any of SEQ ID NOs: 2-31.
25. The engineered Cas protein of any of claims 15-24, wherein the engineered Cas protein comprises any of SEQ ID NOs: 2-31.
26. A fusion protein comprising an engineered Cas protein of any of claims 15-25 and one or more effector domains.
27. A nucleic acid encoding the engineered Cas protein of any of claims 15-25 or a fusion protein thereof.PROF 43950.60128. A system comprising an engineered Cas protein of any of claims 15-25, and / or a fusion protein thereof, and at least one guide RNA, or one or more nucleic acids encoding thereof.
29. A method of modifying a target nucleic acid comprising contacting the target nucleic acid with an engineered Cas protein of any of claims 15-25, and / or a fusion protein thereof, and at least one guide RNA, or one or more nucleic acids encoding thereof.