Novel dna-binding polypeptides containing tandem repeats

By developing novel TALE-like peptides fused with nucleases to form a recombinant gene editing system, the problems of insufficient specificity and binding affinity in existing technologies have been solved, enabling more efficient genome modification and transcriptional regulation.

CN122249559APending Publication Date: 2026-06-19INST OF ZOOLOGY CHINESE ACAD OF SCI +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF ZOOLOGY CHINESE ACAD OF SCI
Filing Date
2024-11-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing gene editing technologies such as ZFN and TALEN have shortcomings in terms of specificity and binding affinity, and there is a need to develop improved TALE-like peptides to improve the precision and efficiency of genome modification.

Method used

A novel TALE-like polypeptide is provided, containing specific amino acid sequences and domains, capable of binding to DNA with sequence specificity and fusing with nucleases to form a recombinant gene editing system for introducing double-strand breaks and modifying genomic sequences in target polynucleotides.

Benefits of technology

It improves the specificity and efficiency of gene editing, enabling more precise genome modification and transcriptional regulation, and enhances the function of gene editing systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure FT_1
    Figure FT_1
  • Figure FT_2
    Figure FT_2
  • Figure FT_3
    Figure FT_3
Patent Text Reader

Abstract

This document provides a novel TRP comprising an N-terminal region, two or more tandem repeats, and a C-terminal region. It also provides a fusion polypeptide comprising the present application's TRP fused with a fusion partner, or a gene editing system comprising the present application's fusion polypeptide, and a method for gene editing using the present application's fusion polypeptide.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to molecular biology. Specifically, this application provides novel TR-containing proteins (TRPs) and their applications in molecular biology, such as gene editing and artificial transcription factor (TF) libraries. Background Technology

[0002] By using site-specific systems, genome modifications at predetermined locations have been achieved. Genome editing technologies, such as macronucleases, designed zinc finger nucleases (ZFNs) or transcription activator-like effector nucleases (TALENs), and the CRISPR / Cas system, can be used to generate targeted genome modifications.

[0003] At the heart of gene editing technology lies a DNA-binding protein that couples to functional domains such as nucleases, deaminases, or reverse transcriptases. Of these four technologies, three are based on protein-DNA recognition, with ZNF and TALE belonging to the tandem repeat (TR) protein category. Of the vast number of TRs existing in nature, only a small fraction have been experimentally studied, and even fewer have been developed into biotechnological tools.

[0004] Each of these technologies has its own set of components and mechanisms. While the CRISPR / Cas system has gained popularity recently, each technology has its advantages and disadvantages, making it suitable for different applications. Therefore, there remains a need to develop improved TALENs, such as searching for novel TALE-like peptides with improved specificity and / or binding affinity. Summary of the Invention

[0005] To meet the above requirements, the inventors identified many TR-containing proteins (TRPs) that can bind to DNA with sequence specificity. Among them, some TRPs from the STAR family exhibit TALE-like structures and are therefore referred to below as TALE-like peptides. The TALE-like peptides identified in this disclosure provide higher affinity than classic TALE peptides, which have smaller size and fewer TRs.

[0006] In a first aspect, this disclosure provides a TRP capable of binding to DNA with sequence specificity. In some embodiments, the TRP is a programmed / programmable TALE-like polypeptide comprising an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises an amino acid sequence selected from formulas I to VII and XXII to XXIV: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) Where X1 is G or S, X2 and X3 are repeating variable double residues (RVD), X4 is G or S, X5 is A or Q, X6 is H or Q, X7 is A or T, and X8 is K or R. FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) Where X1 is T or K, X2 is Q, E or R, X3 is A or G, X4 and X5 are RVD, X6 is N or D, X7 is T or K, X8 is A or V, X9 is T, R or K, X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 For A, D, or T, X 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; FX1X2DNLX3KVAAX4X5GGX6QALLDKX7PX8LRX9AG (XXII) Where X1 is G or S, X2 is N or P, X3 is V or I, X4 and X5 are RVD, X6 is A or Q, X7 is G or S, X8 is A or T, and X9 is Q or N; GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) Where X1X2 is RVD; and FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) Where X1X2 is RVD.

[0007] In some embodiments, TRP is a DNA-binding polypeptide comprising an amino acid sequence selected from SEQ ID NO: 3-6 and 9-22.

[0008] In a second aspect, this disclosure provides a fusion polypeptide comprising a TRP capable of binding DNA in a sequence-specific manner. In some embodiments, the TRP is a programmed / programmable TALE-like or DNA-binding polypeptide of this disclosure fused with a fusion partner.

[0009] In a third aspect, this disclosure provides a recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising the TRP of this disclosure fused with a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

[0010] In a fourth aspect, this disclosure provides a composition comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising the TRP of this disclosure fused with a fusion partner.

[0011] In a fifth aspect, this disclosure provides a method for introducing double-strand breaks in a target polynucleotide, comprising the step of contacting the polynucleotide with a recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a TRP of this disclosure fused with a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

[0012] This disclosure also provides a method for modifying a genome sequence in a cell, including the step of introducing a recombinant gene editing system into the cell, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a TRP of this disclosure fused with a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

[0013] In a sixth aspect, this disclosure provides a randomized library of artificial transcription factors (TFs) comprising multiple cells, each cell carrying a vector containing a nucleotide sequence encoding a polypeptide of this disclosure, wherein tandem repeats are randomized among the cells.

[0014] In a seventh aspect, this disclosure provides a method for screening DNA-binding TRPs, comprising: i) Retrieve proteins from the TR family from the database; ii) Predicting TRs in proteins and screening for TRPs; and iii) Build a model and use the model to analyze the screened TRPs in order to obtain predicted TRPs that can bind to DNA.

[0015] Brief description of the attached figures Figure 1 This demonstrates the identification and characterization of tandem repeat proteins.

[0016] A) Computational flow for identifying known and novel TRs in a non-redundant protein database. B) Schematic diagram of TR-related terminology and parameters. C) Unit number and period distribution for all TRs with period ≤ 80 and repeat units ≤ 80. Dashed boxes indicate the proportion of short TRs with period ≤ 10 and repeat units ≤ 10. The upper box shows the categorical distribution of different periods. Arrows indicate the location of ZNFs. D) Count distribution of known and novel TRs at the protein and cluster levels. Protein and cluster counts are converted to logarithmic values. 10 Scale. TRs of different period lengths are color-coded. E) Cluster size distribution of known and novel TRs. Each color represents a different cluster size range. The absolute number of clusters is shown on the pie chart. F) The left plot shows the cluster size distribution and genus-level assembly counts. Specifically, for TRs within each cluster size range, we retrieved genus-level assembly counts available on NCBI for each protein-derived species and then plotted the corresponding distribution. The right plot shows the Pearson correlation analysis between cluster size and the median genus-level assembly count.

[0017] Figure 2 This demonstrates the DNA binding prediction model and candidate prioritization strategy.

[0018] A) The training dataset was generated, and the PLM-DBPPred architecture integrates three NLP models (ProteinBERT, ProtTrans, and ESM) to predict DBP. B) PLM-DBPPred was compared with other DBP classifiers using ROC-AUC analysis on the PDB600 test dataset. The AUC value of each classifier is shown. C) Evaluation results of various DBP prediction tools. Each column is standardized individually, and performance is presented using color keys. D) Strategies for selecting TR clusters and candidates. A Venn diagram shows the list of enriched clusters generated by different functional annotation methods, where DBP, DRD, and DBGO represent DNA-binding proteins, DNA-related domains, and DNA-binding-related GO annotations predicted by PLM-DBPPred, respectively.

[0019] Figure 3 This demonstrates the experimental screening and validation design.

[0020] A) Experimental screening process. For in vivo B1H screening, all 100 candidate genes were cloned and inserted into B1H protein expression vectors, followed by screening. For in vitro screening, the expression levels of all 100 candidate genes in *E. coli* were first assessed. Highly expressed proteins were then purified. The purified proteins were then screened using BLI to test their DNA-binding activity. Candidates exhibiting DNA-binding activity were further subjected to SELEX, and the enriched libraries were sequenced to obtain enriched binding motifs. The numbers in each step indicate the number of candidates identified and selected in that screening process. A Venn diagram illustrates the positive candidates identified by different screening methods. Light blue and dark blue represent positive candidates with DNA-binding activity and those with specific DNA-binding activity, respectively. B) Positive candidates identified by the B1H platform. For each candidate, we provide information including the protein accession number (each given a custom name, shown in parentheses), the species of origin, the repeating weblogo, and the enriched binding motif. C) Positive candidates identified by the SELEX platform. D) Experimental design for validation and characterization of positive candidates. We established four independent validation methods to verify the binding activity of proteins to enriched motifs. These methods included an in vitro platform (EMSA), two GFP-based platforms in *E. coli*, and a CUT&Tag platform in 293T cells (Materials and Methods). Proteins validated in two or more independent assays were considered true positives. Analysis of positive candidates was then extended to three further analyses: sequence and structural characterization, family-level characterization, and prediction of natural biological functions.

[0021] Figure 4 Displaying the characteristics of the STAR family.

[0022] A) Pq STAR1 and Asp EMSA validation results for STAR1. Protein concentration varied, while probe concentration remained constant at 40 nM. The abbreviation "C" indicates a competing probe, used at a concentration 100 times that of the specific probe. (B) Pq STAR1 and Asp GFP activation results for STAR1. MutODD, a non-DNA-binding protein, was used as a negative control. "P" and "R" represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. *** p Value < 0.001. (C) PqS TAR1 and Asp BLI binding results for STAR1. Pq STAR1 protein concentrations were 0.25 nM, 0.5 nM, 1 nM, 2 nM, and 4 nM.Asp STAR1 protein concentrations were 3.125 nM, 6.25 nM, 12.5 nM, 25 nM, and 50 nM. D) Negative staining. Pq STAR1-DNA complex, Asp Electron micrographs of the STAR1-DNA complex and their representative 2D classification averages. The top scale bar is 50 nm, and the bottom scale bar is 10 nm. (E) Pq STAR1、 Asp Protein architecture of STAR1 and AvrBs3. TS, T3SS. NLS, nuclear localization signal. TAD, transcriptional activation domain. Encoding predicted... Pq The STAR1 gene lacks a start codon; therefore, we placed a "?" at the N-terminus of STAR1. (F) Pq STAR1、 Asp Sequence comparison of STAR1 and AvrBs3. (G) Pq STAR1、 Asp The tertiary structure of STAR1 and AvrBs3 proteins. Pq STAR1 and Asp The structure of STAR1 was predicted by AlphaFold2. The structure of AvrBs3 was obtained from the PDB database (2YPF). N and C represent N-terminals and C-terminals, respectively. Repeating regions and RVDs are colored blue and red, respectively. Enlarged dashed boxes show the structure of a single repeating unit. (H) Pq STAR homologs and Asp STAR homology multiple sequence alignment and phylogenetic tree. The phylogenetic tree was constructed using full-length protein sequences. In each alignment column, the degree of conservation is indicated by the background color of the amino acid. Bootstrapping confidence scores are indicated by the size of the circles. Different colored squares in the middle represent different taxa. Color and identity coordinates are shown on the right. I) Classical TALE, Pq STAR and Asp The distribution of unit number in STAR. Classical TALE represents TALE from the genus Xanthomonas. Pq STAR stands for STAR, which comes from the genus Pseudomonas. Asp STAR represents STAR species from the genus *Tetranychus*. Statistical analysis was performed using one-way ANOVA. * p Value < 0.05; ****, p Value < 0.0001. (J) Pq STAR1 and Asp DNA-binding specific reprogramming of STAR1. For Pq STAR1, WT, 23m, and 67m represent wild-type protein, RVD2 and RVD3 modified variants, and RVD6 and RVD7 modified variants, respectively. ForAsp STAR1, WT, 3m, and 678m represent the wild-type protein, the RVD3-modified variant, and the RVD6, RVD7, and RVD8-modified variants, respectively. RVD changes are highlighted in red. The corresponding nucleotide bases under RVD are deduced from the binding codon of classical TALE. In EMSA assays, probe 1 corresponds to the target sequence of the wild-type protein, while probes 2 and 3 represent the target sequences of the two variants. K) GFP activation results of artificial STAR. MutODD is a non-DNA-binding protein used as a negative control. “P” and “R” represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. p Value < 0.01; ***, p Value < 0.001; ****, p Value <0.0001. L) 293T cells Pq STAR1 and Asp STAR1 cut & tag results. Two biologically independent samples were used.

[0023] Figure 5 Characterization of STAR-based transcriptional regulators.

[0024] A) Schematic diagram of STAR-based transcriptional regulators. B) RVD design of reported TF-binding motifs and STAR. C) Principal component analysis (PCA) of RNA-seq samples. PCA was performed using rlog transformation count data from DESeq2. D) Transcripts significantly upregulated (red) and downregulated (blue) after STAR-based ATF transduction. p Value < 0.05 and log2 fold change > |1|). E) Motifs enriched in the promoter regions of upregulated genes. F) Gene set enrichment analysis (GSEA) plots of two previously characterized NF-κB-regulated gene sets. Gene sets 1 and 2 are from Cormier et al. 2023 (NF-κB signaling activation and roles in thyroid cancers: implication of MAP3K14 / NIK). Oncogenesis , 12, 55) and the TFlink database (Liska et al., 2022, TFLink: an integrated gateway to access transcription factor–target gene interactions for multiple species). Database,2022, baac083). NES, standardized enrichment score. G) Heatmap of standardized read counts of genes listed in f (combination of gene sets 1 and 2). Statistical significance analysis was performed using the Wilcoxon signed-rank test. * p Value < 0.05, ** p Value <0.01. H) Gene set enrichment analysis (GSEA) plots of two previously characterized SMAD4 regulatory gene sets. Gene sets 1 and 2 were obtained from the MSigDB database (SMAD4_Q6) (Liberzon et al., 2015, The molecular signatures database hallmark gene set collection). Cell systems Genes listed in I) 1, 417-425) and the TFlink database were analyzed. A heatmap of normalized read counts (combination of gene sets 1 and 2) was generated. Statistical significance was analyzed using the Wilcoxon signed-rank test. p Value < 0.01; ***, p Value < 0.001.

[0025] Figure 6 Displaying the characteristics of the MOON family.

[0026] A) Sp EMSA results for MOON1. GC-rich sequences were used as controls. Protein concentration varied, while probe concentration remained constant at 40 nM. (B) Sp GFP activation verification of MOON1. MutODD, a non-DNA-binding protein, was used as a negative control. "P" and "R" represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. * p Value < 0.05; **, p Value < 0.01. (C) Sp BLI results for MOON1. Protein concentrations were set as follows: 9.8 nM, 14.8 nM, 22.2 nM, 33.3 nM, 50 nM. D) Negative staining Sp Electron micrographs of the MOON1-DNA complex and their representative 2D classification averages. The top image scale bar is 50 nm, and the bottom image scale bar is 10 nm. (E) SpThe protein architectures of MOON1, KI67_HUMAN, and KI67_MOUSE are described. The architectures of KI67_HUMAN and KI67_MOUSE are derived from the literature (Sobecki et al., 2016, The cell proliferation antigen Ki-67 organizes heterochromatin). elife , 5,e13722). F) Sp Sequence comparison of MOON1, KI67_HUMAN, and KI67_MOUSE. (G) Sp Predicted tertiary structure of MOON1 protein. The FHA, PP1_bind domains, and repeat regions are colored yellow, orange, and blue, respectively. Enlarged dashed boxes show the structure of a single repeat unit. H) Domain architecture, repeats, and secondary structure weblogo of the MOON family. Functional domains FHA and PP1_bind are shown. The C-terminal region (indicated by a red rectangle) contains repeat units. The number of units within the MOON family ranges from 6 to 36. I) Multiple sequence alignment and phylogenetic tree of proteins in the MOON family. The phylogenetic tree was constructed using full-length protein sequences. Experimental testing... Sp The names of MOON1 homologs are labeled. In each alignment column, the degree of conservation is indicated by the background color of the amino acid. Bootstrapping confidence scores are indicated by the size of the circles. Different colored squares in the middle represent different taxa. Color and identity coordinates are shown on the right. Sp Species, repeating weblogo, and enriched motifs of MOON2. K) Assessment using BLI assays. Sp MOON2 binds to AT / GC-rich probes. Protein concentrations were set as follows: 9.8 nM, 14.8 nM, 22.2 nM, 33.3 nM, and 50 nM.

[0027] Figure 7 Display the characterization of the pTERF family.

[0028] A) EMSA results for pTERF1. Protein concentration varied, while probe concentration remained constant at 40 nM. B) GFP activation verification of pTERF1. MutODD, a non-DNA-binding protein, was used as a negative control. "P" and "R" represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. *** pValue <0.001. C) BLI results of pTERF1. Protein concentration settings were as follows: 1.48 nM, 2.22 nM, 3.33 nM, 5 nM, 7.5 nM. D) Electron micrographs of negatively stained pTERF1-DNA complexes and their representative 2D classification averages. The scale bar for the top image is 50 nm, and the scale bar for the bottom image is 10 nm. E) Architecture of pTERF1, MTEF1_HUMAN, and MTTF_DROME proteins. The architectures of MTEF1_HUMAN and MTTF_DROME are derived from the literature (Roberti et al., 2009, The MTERF family proteins: mitochondrial transcription regulators and beyond). Biochimica et Biophysica Acta (BBA)-Bioenergetics F) Sequence comparisons of pTERF1, MTEF1_HUMAN, and MTTF_DROME. G) Tertiary structures of pTERF1 and MTEF1_HUMAN (PDB: 3MVA). Enlarged boxes show the structure of a single repeat unit or mTERF motif. H) Domain architecture, repeats, and secondary structure weblogos of the pTERF family. Red rectangles represent repeat units within the pTERF protein, ranging from 3 to 11. I) Multiple sequence alignments and phylogenetic trees of the pTERF family. The names of experimentally tested pTERF1 homologs are labeled. In each alignment column, the degree of conservation is indicated by the background color of the amino acid. Bootstrapping confidence scores are indicated by the size of the circles. Color and identity coordinates are shown on the right. J) Species, repeat weblogos, and enriched motifs of pTERF2. K) EMSA results for pTERF2. Protein concentration varies, while probe concentration remains constant at 40 nM. L) BLI results for pTERF2. The protein concentrations were set as follows: 0.625 nM, 1.25 nM, 2.5 nM, 5 nM, and 10 nM.

[0029] Figure 8 show Pq Characterization of STAR4.

[0030] A) Pq Species, repeating weblogo, and enriched motifs of STAR4. (B) Pq GFP activation validation of STAR4. MutODD, a non-DNA-binding protein, was used as a negative control. "P" and "R" represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. *** p Value < 0.001.

[0031] Figure 9 Characterization of the TRP shown in SEQ ID NO:10.

[0032] A) Species, repeating weblogo, and enriched motif of SEQ ID NO:10 A0A662FLR2. B) GFP activation verification of SEQ ID NO:10 A0A662FLR2. MutODD, a non-DNA-binding protein, was used as a negative control. “P” and “R” represent the protein and reporter plasmid, respectively. Bilateral Student t Test, n=3 biologically independent samples. Data are expressed as mean ± standard deviation. * p Value <0.05. C) EMSA and BLI results of SEQ ID NO:10 A0A662FLR2. For EMSA assays, protein concentration varied while probe concentration remained constant at 40 nM. For BLI assays, protein concentrations were set as follows: 62.5 nM, 125 nM, 250 nM, 500 nM, and 1000 nM. Invention Details 1. Definition This application relates to proteins / peptides containing tandem repeats (TRPs) that bind to DNA, preferably with sequence specificity. As used herein, "proteins / peptides containing tandem repeats" and "TRP" refer to peptides comprising an N-terminal region, a C-terminal region, and an intermediate region consisting of tandem repeats (TRs). For example, the classic TALE (such as the TALE derived from AvrBs3 (UniProt number P14727)) is one of the known DNA-binding TRPs, in which each TR contains a repeating variable diresidue (RVD) that recognizes and binds to a nucleotide. Other residues in the repeat do not participate in DNA binding and can be considered as non-RVD scaffolds. For the classic TALE, the RVDs are selected from: HD (recognizing C), NN (recognizing A and G), NG (recognizing T), NS / NI (recognizing A), NK (recognizing G), KS / KG (recognizing T), and HI (recognizing A and G).

[0034] The term "TALE-like polypeptide" refers to a polypeptide having a structure similar to that of a known TALE, i.e., containing TR, as identified in this disclosure.

[0035] In the context of this application, the TR in a programmed / programmable TALE-like polypeptide that is “derived from” the natural TR in the TRP or “derived from” the TRP refers to the TR obtained by modifying the RVD, preferably without modifying other residues in the natural TR.

[0036] The terms “polypeptide” and “protein” are used interchangeably in this document, referring to a polymer of amino acids and including full-length proteins as well as fragments thereof.

[0037] Nucleotide endonucleases are enzymes that cleave phosphodiester bonds in polynucleotide chains, including restriction endonucleases that cleave DNA at specific sites without damaging bases. Examples of endonucleases include, but are not limited to, restriction endonucleases, macronucleases, TAL effector nucleases (TALENs), zinc finger nucleases, and Cas (CRISPR-associated) effector endonucleases.

[0038] As used herein, “nucleic acid” refers to a polynucleotide and includes single- or double-stranded polymers composed of deoxyribonucleotide or ribonucleotide bases. Nucleic acids may also include fragments and modified nucleotides. Therefore, the terms “polynucleotide,” “nucleic acid sequence,” “nucleotide sequence,” and “nucleic acid fragment” are used interchangeably to refer to single- or double-stranded RNA and / or DNA and / or RNA-DNA polymers, which optionally contain synthetic, non-natural, or modified nucleotide bases. Nucleotides (usually present as 5'-monophosphates) are represented by their single-letter names as follows: “A” represents adenosine or deoxyadenosine (corresponding to RNA or DNA, respectively), “C” represents cytidine or deoxycytidine, “G” represents guanosine or deoxyguanosine, “U” represents uridine, “T” represents deoxythymidine, “R” represents purine (A or G), “Y” represents pyrimidine (C or T), “K” represents G or T, “H” represents A, C, or T, “I” represents inosine, and “N” represents any nucleotide.

[0039] The term "genome," when applied to prokaryotic and eukaryotic cells or somatic cells, includes not only chromosomal DNA within the cell nucleus but also organelle DNA within subcellular components of the cell (such as mitochondria or plastids). The term "genome" refers to all genetic material (genes and non-coding sequences) present in every cell of an organism, virus, or organelle; and / or the complete set of chromosomes inherited as (haploid) units from one parent.

[0040] The term "homology" refers to similar DNA sequences. For example, a "region homologous to a genomic region" present on donor DNA is a DNA region with a sequence similar to a given "genomic region" in the genome of a cell or organism. The length of a homologous region can be any length sufficient to promote homologous recombination at the target site of cleavage. For example, homologous regions may contain 5 to 3000 or more bases, such as at least 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900 or 3000 bases, to achieve homologous recombination with the corresponding genomic regions.

[0041] As used in this article, a “genomic region” refers to a segment of DNA on a cell chromosome or organelle that is located upstream or downstream of a target site, or that also contains a portion of the target site (located at the 5' or 3' end). Genomic regions may contain 5 to 3000 or more bases, for example, at least 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2100, 2200, 2300, 2400, 2500, 2600, 2700, 2800, 2900 or 3000 bases in length, to enable homologous recombination with corresponding homologous regions.

[0042] The term "homologous recombination" (HR) refers to the exchange of DNA fragments between two DNA molecules at homologous sites. The frequency of homologous recombination is influenced by a variety of factors. The amount of homologous recombination and the ratio of homologous to non-homologous recombination vary among different organisms. Generally, the length of the homologous region affects the frequency of homologous recombination events: the longer the homologous region, the higher the frequency. Furthermore, homologous recombination requires a certain length of homologous region, which varies from species to species. See, for example, Singer et al., (1982). Cell 31:25-33; Shen and Huang, (1986) Genetics 112:441-57; Watt et al., (1985) Proc. Natl. Acad. Sci. USA82:4768-72; Sugawara and Haber, (1992) Mol Cell Biol 12:563-75; Rubnitz and Subramani, (1984) Mol Cell Biol 4:2253-8; Ayares et al., (1986) Proc. Natl. Acad. Sci. USA 83:5199-203; Liskay et al., (1987) Genetics 115:161-7.

[0043] In the context of nucleotide or amino acid sequences, "sequence identity" or "identity" refers to the same nucleotide bases or amino acid residues in two sequences when performing maximum correspondence alignment within a specified comparison window.

[0044] "Sequence identity percentage" refers to a value determined by comparing two best-aligned sequences within an alignment window. This window contains the number of additions or deletions (i.e., vacancies) that the nucleotide or amino acid sequence portion can include to achieve optimal alignment compared to a reference sequence (excluding additions or deletions). The sequence identity percentage is calculated by dividing the number of matching positions (i.e., positions where nucleotide bases or amino acid residues are identical in both sequences) by the total number of positions in the alignment window and multiplying the result by 100. For example, when aligning two sequences, if they are best-aligned within a comparison window of 1000 positions and have 950 identical positions, then the sequences are 95% identical.

[0045] Various comparison methods have been designed for sequence alignment and calculation of identity or similarity percentages, including but not limited to the MegAlign™ program in the LASERGENE Bioinformatics Computing Suite (DNASTAR, Madison, Wisconsin). Throughout this application, it should be understood that when using sequence analysis software, unless otherwise stated, the analysis results will be based on the “default values” of the cited program. As used herein, “default values” refers to any set of values ​​or parameters loaded during the initial software initialization.

[0046] BLAST is a search algorithm provided by NCBI used to find similar regions between biological sequences. The program compares nucleotide or protein sequences to a sequence database and calculates the statistical significance of matches to identify sequences that are sufficiently similar to the query sequence, ensuring that such similarity is not random. BLAST reports the identified sequences and their local alignment results with the query sequence. Those skilled in the art will understand that many different levels of sequence identity are very useful when identifying polypeptides from other species or those that are naturally or synthetically modified, as these polypeptides have the same or similar functions or activities. Useful examples of percentage representations include, but are not limited to, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, or 95%, or any percentage between 50% and 100%. In fact, any amino acid identity between 50% and 100% may be useful in describing the contents of this disclosure, such as 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 64%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99%.

[0047] "Isolated" polynucleotides, peptides, or proteins are those that are substantially or entirely free of components that normally accompany or interact with them in their natural environment. Therefore, isolated polynucleotides, peptides, or proteins are substantially free of other cellular material or culture media (when produced using recombinant technologies), or substantially free of chemical precursors or other chemicals (when chemically synthesized). Preferably, "isolated" polynucleotides are free of naturally occurring flanking sequences (i.e., sequences located at the 5' and 3' ends of the polynucleotide) in the genomic DNA of their source organism. Isolated polynucleotides and peptides can be purified from the cells in which they are naturally present. Methods for isolating or purifying polynucleotides or peptides are known.

[0048] The term "fragment" refers to a sequence of consecutive nucleotides or amino acids. In one embodiment, a fragment comprises 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more consecutive nucleotides. In one embodiment, a fragment comprises 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more consecutive amino acids. A fragment may or may not have the function of a sequence that has a certain percentage of sequence identity with its length.

[0049] A "functional fragment" refers to a portion of a separated polynucleotide or polypeptide that has the same activity or function as the longer or full-length sequence that makes up the fragment.

[0050] A “gene” is a segment of nucleic acid that expresses a functional molecule (such as, but not limited to, a specific protein), including the regulatory sequence preceding the coding sequence (5' non-coding sequence) and the regulatory sequence following the coding sequence (3' non-coding sequence).

[0051] "Endogenous" refers to sequences or other molecules that naturally exist in cells or organisms. Endogenous polynucleotides are usually found in the genome of cells, i.e., they are not heterologous.

[0052] "Heterologity" refers to the difference between the original environment, location, or composition of a particular polynucleotide or polypeptide and its current environment, location, or composition. Non-limiting examples include differences in taxonomic origin (e.g., a polynucleotide obtained from species A is heterologous if it is inserted into the genome of species B, or the genome of a different variety or cultivar of species A; or a polynucleotide obtained from bacteria is introduced into plant or animal cells), or sequence differences (e.g., a polynucleotide obtained from species A is isolated, modified, and then reintroduced into a plant of species A).

[0053] A “coding sequence” refers to a nucleotide sequence that encodes a specific amino acid sequence. A “regulatory sequence” refers to a nucleotide sequence located upstream (5' non-coding sequence), inside, or downstream (3' non-coding sequence) of a coding sequence. These sequences affect the transcription, RNA processing, or stability of the related coding sequence, or translation. Regulatory sequences include, but are not limited to, promoters, pretranslational sequences, 5' untranslated regions, 3' untranslated regions, introns, polyadenylation signal sequences, RNA processing sites, effector binding sites, and stem-loop structures.

[0054] A "mutated gene" is a gene that has been altered through human intervention. The sequence of a "mutated gene" differs from that of a corresponding non-mutated gene by at least one nucleotide addition, deletion, insertion, or substitution. In this disclosure, a mutant gene comprises alterations caused by the fusion polypeptides disclosed herein, such as fusion polypeptides containing a fusion chaperone that provides nuclease activity. A mutant organism is an organism that contains a mutant gene.

[0055] As used herein, “targeted mutation” means a mutation produced in a target sequence within a gene using any method known to those skilled in the art, including methods using a TALEN comprising a TALE-like peptide disclosed herein and fused with a nuclease. The term “gene knockout” refers to the partial or complete inactivation of a DNA sequence in a cell, for example, by targeted knockout using a TALEN comprising a TALE-like peptide disclosed herein and fused with a nuclease; for example, the DNA sequence prior to knockout may encode an amino acid sequence or have a regulatory function (e.g., a promoter).

[0056] The term "gene knock-in" refers to the replacement or insertion of a DNA sequence into a specific site in the cell's genome through TALEN-targeted knockout (e.g., through homologous recombination (HR), in which a suitable donor DNA polynucleotide is also used). Gene knock-in can be a specific insertion of a heterologous nucleotide sequence encoding an amino acid sequence or functional RNA, or a specific insertion of a transcriptional regulatory element.

[0057] The term "domain" refers to a continuous nucleotide sequence (which can be RNA, DNA, and / or RNA-DNA combination sequences) or a continuous or discontinuous amino acid sequence.

[0058] A “conserved domain” or “motif” refers to a group of nucleotides or amino acids that are conserved at a specific position in the aligned sequence of an evolutionarily relevant gene or protein. While homologous proteins may have differences in nucleotides or amino acids at other positions, highly conserved nucleotides or amino acids at specific positions indicate that these amino acids are essential for the structure, stability, or function of polynucleotides or proteins.

[0059] "Codon-optimized" nucleotide sequences refer to nucleotide sequences whose codon usage frequencies are designed to mimic the preferred codon usage frequencies of the host cell. "Optimized" polynucleotides contain nucleotide sequences optimized for improved expression in specific heterologous host cells.

[0060] A promoter is a nucleotide sequence that participates in the recognition and binding of RNA polymerases and other proteins to initiate transcription. Promoters can be derived entirely from natural genes, or consist of different elements derived from different promoters found in nature, and / or contain synthetic DNA fragments. Those skilled in the art will understand that different promoters can direct gene expression in different tissues or cell types, or at different developmental stages, or in response to different environmental conditions. It is further recognized that, since the exact boundaries of regulatory sequences are not fully defined in most cases, some variant DNA fragments may possess the same promoter activity.

[0061] Promoters that cause genes to be expressed in most tissues or cell types most of the time are generally called “constitutive promoters.” The terms “inducible promoters” or “regulatory promoters” refer to promoters that selectively express coding sequences or functional RNA in response to the presence of endogenous or exogenous stimuli, such as by compounds (chemical inducers) or in response to environmental, hormonal, chemical, and / or developmental signals. Inducible or regulatory promoters include promoters that are induced or regulated, for example, by light, heat, stress, flooding or drought, salt stress, osmotic stress, plant hormones, damage, or chemical substances (such as ethanol, abscisic acid (ABA), jasmonic acid, salicylic acid, or safeners).

[0062] An "enhancer" is a nucleotide sequence that can stimulate promoter activity and can be an intrinsic element of the promoter or a heterologous element inserted to enhance promoter activity or tissue specificity.

[0063] The term "lead sequence" refers to the nucleotide sequence located between the promoter and coding sequences. Lead sequences are present in mRNA upstream of the start codon. Lead sequences can influence the processing of primary transcripts into mRNA, mRNA stability, or translation efficiency.

[0064] The term "3' non-coding sequence" (which can be used interchangeably with "transcription terminator" or "termination sequence") refers to a nucleotide sequence located downstream of a coding sequence, including polyadenylation signaling sequences and other coding regulatory signals that can affect mRNA processing or gene expression. Polyadenylation signals are typically characterized by their influence on the addition of polyadenylate segments to the 3' end of precursor mRNA.

[0065] The term "host" refers to an organism or cell in which a heterologous component (polynucleotide, polypeptide, other molecule, cell) has been introduced. As used herein, "host cell" refers to an in vivo or isolated eukaryotic cell, prokaryotic cell (e.g., bacterial or archaic cell), or a cell derived from a multicellular organism (e.g., a cell line) and cultured as a single-cell entity, in which a heterologous polynucleotide or polypeptide has been introduced. Cells are selected from: archaic cells, bacterial cells, eukaryotic cells, eukaryotic single-celled organisms, somatic cells, germ cells, stem cells, plant cells, algal cells, and animal cells, such as invertebrate cells, vertebrate cells, fish cells, frog cells, avian cells, insect cells, mammalian cells, pig cells, bovine cells, goat cells, sheep cells, rodent cells, rat cells, mouse cells, non-human primate cells, and human cells. In some cases, the cells are isolated. In some cases, the cells are in vivo.

[0066] The term "recombinant" refers to the artificial combination of two originally separate sequence segments, such as through chemical synthesis or genetic engineering techniques.

[0067] The terms "plasmid" and "vector" refer to linear or circular extrachromosomal elements, typically in the form of double-stranded DNA, that usually carry genes that do not participate in the cell's central metabolism. Such elements can be autonomously replicating sequences, genome-integrated sequences, bacteriophages, or nucleotide sequences, and can be linear or circular, single-stranded or double-stranded DNA or RNA, derived from any source, in which multiple nucleotide sequences have been linked or recombined into a unique construct capable of introducing target polynucleotides into the cell.

[0068] The term "construction," when referring to nucleic acid molecules, refers to an artificial combination of nucleic acid sequences, such as regulatory and coding sequences that are not all found together in nature. When a nucleic acid construct contains control sequences required to express the coding sequence of this application, the term is synonymous with the term "expression cassette." For example, a nucleic acid construct may contain regulatory and coding sequences from different sources, or regulatory and coding sequences from the same source but arranged in a manner different from those found in nature. Such constructs may be used alone or in combination with a vector. If a vector is used, the choice of vector depends on the method to be used to introduce the vector into a host cell, as is well known to those skilled in the art. A vector used to express a coding sequence (e.g., containing an expression construct) is called an "expression vector."

[0069] As used in this article, the term “expression” refers to the production of functional end products (such as mRNA, guide RNA, or protein) in their precursor or mature form.

[0070] As used herein, an "effectant" or "effectant protein" is a protein that contains activity including recognizing, binding, and / or cleaving or nicking polynucleotide targets. An effector or effectant protein may also be an endonuclease, such as the TALE-like peptide or fusion peptide disclosed herein.

[0071] The term "functional fragment" of a TALE-like peptide refers to a portion of the TALE-like peptide disclosed herein that retains the ability to recognize and bind to target sites. The term "functional variant" of a TALE-like peptide refers to a variant of the TALE-like peptide disclosed herein that retains the ability to recognize and bind to target sequences.

[0072] The terms "targeting domain" and "targeting region" are used interchangeably herein and include a nucleotide sequence capable of hybridizing (complementing) with one strand (nucleotide sequence) of a double-stranded DNA target site. The percentage of complementarity between the targeting region and the target sequence can be at least 50%, 51%, 52%, 53%, 54%, 55%, 56%, 57%, 58%, 59%, 60%, 61%, 62%, 63%, 63%, 65%, 66%, 67%, 68%, 69%, 70%, 71%, 72%, 73%, 74%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 100%. The length of the variable targeting region can be at least 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, or 30 nucleotides. In some embodiments, the variable targeting domain comprises a continuous segment of 12 to 30 nucleotides. The targeting domain can consist of a DNA sequence, an RNA sequence, a modified DNA sequence, a modified RNA sequence, or any combination thereof.

[0073] The terms “target site,” “target sequence,” and “target region” are used interchangeably herein to refer to a nucleotide sequence on a chromosome, free gene, locus, or any other DNA molecule in the cellular genome (including chromosomes, chloroplasts, mitochondrial DNA, plasmid DNA) that the TALE-like polypeptide of this disclosure can recognize and bind to. The target site can be an endogenous site in the cellular genome, or it can be heterologous to the cell and therefore not naturally present in the cellular genome, or it can be found at a heterologous genomic location relative to its natural genomic location.

[0074] The terms “altered target site,” “altered target sequence,” “modified target site,” and “modified target sequence” are used interchangeably herein to refer to the target sequence disclosed herein that contains at least one alteration compared to the unaltered target sequence. Such an “alteration” includes, for example: (i) substitution of at least one nucleotide, (ii) deletion of at least one nucleotide, (iii) insertion of at least one nucleotide, (iv) chemical alteration of at least one nucleotide, or (v) any combination of (i)-(iv).

[0075] "Modified nucleotide" or "edited nucleotide" refers to a target nucleotide sequence that contains at least one alteration compared to its unmodified nucleotide sequence. Such an "alteration" includes, for example: (i) substitution of at least one nucleotide, (ii) deletion of at least one nucleotide, (iii) insertion of at least one nucleotide, (iv) chemical change of at least one nucleotide, or (v) any combination of (i)-(iv).

[0076] The terms "modifying target sites" and "altering target sites" are used interchangeably in this article, referring to methods used to generate altered target sites.

[0077] As used herein, “donor DNA” is a DNA construct containing a target polynucleotide to be inserted into the target site of the TALE-like polypeptide disclosed herein.

[0078] The term "polynucleotide modification template" includes polynucleotides that, when compared to the nucleotide sequence to be edited, contain at least one nucleotide modification. A nucleotide modification can be a substitution, addition, insertion, or deletion of at least one nucleotide. Optionally, the polynucleotide modification template may also contain a homologous nucleotide sequence flanking the at least one nucleotide modification, wherein the flanking homologous nucleotide sequence provides sufficient homology to the desired nucleotide sequence to be edited.

[0079] As used herein, the term "before" in relation to sequence position refers to a sequence appearing upstream of another sequence (5' end for nucleotide sequences, N-terminus for amino acid sequences). The term "after" in relation to sequence position refers to a sequence appearing downstream of another sequence (3' end for nucleotide sequences, C-terminus for amino acid sequences).

[0080] 2. TR-containing polypeptides capable of binding to DNA with sequence specificity The inventors have identified several TRPs capable of binding to DNA with sequence specificity. Through sequence analysis, the inventors classified these TRPs into the STAR (short TALE-like repeat protein), MOON (marine-derived DNA-binding protein), and pTERF (prokaryotic mTERF-like protein) families. The STAR family includes two types of TRPs: Pq STAR1 (SEQ ID NO: 1) and Asp STAR1 (SEQ ID NO: 2), each containing nine TRs, exhibits a structure similar to the classic TALE (AvrBs3), and the same TALE encoding (recognizing nucleotides via RVD).

[0081] Therefore, this disclosure provides a TRP capable of binding DNA in a sequence-specific manner. In some embodiments, the TRP is a programmed / programmable TALE-like polypeptide derived from the STAR family of TRPs, wherein the TR domain is programmed / programmable by inserting, deleting, and / or rearranging the TR, and / or substituting the RVD in the TR.

[0082] In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises the following amino acid sequence: FX1X2DNLX3X4X5X6X7X8X9GX 10 X 11 X 12 X 13 LX 14 X 15 LLX 16 X 17 X 18 PX 19 LX 20 X 21 X 22 G in X1 is G or S, X2 is N or P, X3 is V or I, X4 is K or R, X5 is V or I, X6 is A or G, X7 is A or G, X8 and X9 are repeating variable double residues (RVD), X 10 For G, S, or A, X 11 For A, Q, or K, X 12 For Q, H, or K, X 13 For A or T, X 14 For Q or D, X 15 For A or T, X 16 For D or Q, X 17 For K, V, or R, X 18 For G, Y, or S, X 19 For A, K, Q, R, or T, X 20 For R, A, or T, X 21 It is Q or N, and X 22 It can be either A or G.

[0083] In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises an amino acid sequence selected from formulas I to VII and XXII to XXIV: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in X1 is G or S, X2 and X3 are repeating variable double residues (RVD), X4 is G or S, X5 is A or Q, X6 is H or Q, X7 is A or T, and X8 is K or R. FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in X1 is T or K, X2 is Q, E or R, X3 is A or G, X4 / X5 is RV / D, X6 is N or D, X7 is T or K, X8 is A or V, X9 is T, R or K, X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 Let X be A, D, or T, and X be... 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; FX1X2DNLX3KVAAX4X5GGX6QALLDKX7PX8LRX9AG (XXII) Where X1 is G or S, X2 is N or P, X3 is V or I, X4 and X5 are RVD, X6 is A or Q, X7 is G or S, X8 is A or T, and X9 is Q or N; GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) Where X1X2 is RVD, as well as FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) Where X1X2 is RVD.

[0084] In some implementations, RVD is selected from: HD (recognizing C), NN (recognizing A and G), NG (recognizing T), NS / NI (recognizing A), NK (recognizing G), KS / KG (recognizing T), and HI (recognizing A and G).

[0085] In some embodiments, each tandem repeat comprises an amino acid sequence selected from formulas III to XXI and XXIII to XXXV: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXVI) FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG (XXVII) FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG (XXVIII) FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXIX) FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG (XXX) FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXI) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXII) FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG (XXXIII) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG (XXXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXV) Where X1X2 is RVD.

[0086] The N-terminal and C-terminal regions can be those from TALEs known in the art (such as AvrBs3). The N-terminal and C-terminal regions can be those from TRPs in the STAR family of this disclosure.

[0087] In some embodiments, the N-terminal region comprises the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or the amino acid sequence of positions 1-39 of SEQ ID NO: 27. In some embodiments, the N-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the N-terminus. For example, the N-terminal region comprises the amino acid sequence of positions 5-53, 10-53 or 15-53 of SEQ ID NO: 1; the N-terminal region comprises the amino acid sequence of positions 5-97, 10-97, 15-97 or 20-97 of SEQ ID NO: 2; or the N-terminal region comprises the amino acid sequence of positions 5-39, 10-39 or 15-39 of SEQ ID NO: 27.

[0088] In some embodiments, the C-terminal region comprises the amino acid sequence of positions 351-382 of SEQ ID NO: 1, the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or the amino acid sequence of positions 469-500 of SEQ ID NO: 27. In some embodiments, the C-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the C-terminus. For example, the C-terminal region comprises the amino acid sequence of positions 351-380, 351-375 or 351-370 of SEQ ID NO: 1; the C-terminal region comprises the amino acid sequence of positions 392-750, 392-700 or 392-650 of SEQ ID NO: 2; or the C-terminal region comprises the amino acid sequence of positions 469-495, 469-490 or 469-485 of SEQ ID NO: 27.

[0089] In some embodiments, the N-terminal region comprises an amino acid sequence having at least 60%, 70%, 80%, 90%, 92%, 94%, 96%, or 98% identity with the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identity with the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, or 97% identity with the amino acid sequence of positions 1-39 of SEQ ID NO: 27.

[0090] In some embodiments, the C-terminal region contains an amino acid sequence that has at least 60%, 70%, 80%, 90%, 93%, or 96% identity with the amino acid sequence of positions 351-382 of SEQ ID NO: 1, at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, or 96% identity with the amino acid sequence of positions 469-500 of SEQ ID NO: 27.

[0091] Surprisingly, the inventors discovered that STAR achieves higher affinity than the classic TALE with less TR. Specifically, Pq STAR1 and Asp STAR1 contains nine TRs, recognizing and binding to sequences of nine nucleotides, while AvrBs3 contains eighteen TRs, but... Pq STAR1 and Asp STAR1 has a higher or comparable DNA binding affinity to AvrBs3. Therefore, the target sequence of TALE-like peptides can be shorter than that of classic TALE peptides, providing greater flexibility in TALE-like peptide design.

[0092] In some embodiments, the programmed / programmable TALE-like peptide comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more TRs. Preferably, the programmed / programmable TALE-like peptide comprises 3-17, 4-17, 5-17, 6-17, 7-17, 8-17 or 9-17 TRs.

[0093] In some implementations, TR is the same except for RVD. In some implementations, TR differs from each other among the residues other than RVD.

[0094] In some implementations, TR is derived from the natural TR of a single TRP. In some implementations, TR is derived from SEQ ID NO: 1. In some implementations, TR is derived from SEQ ID NO: 2. In some implementations, TR is derived from SEQ ID NO: 27.

[0095] In some embodiments, each TR contains an amino acid sequence selected from formulas I, III, and IV. In some embodiments, each TR contains an amino acid sequence selected from formulas II and V through VII. In some embodiments, each TR contains an amino acid sequence selected from formulas XXII through XXIV.

[0096] In some embodiments, each TR contains an amino acid sequence selected from formulas III, IV, and VIII to XIV. In some embodiments, each TR contains an amino acid sequence selected from formulas V to VII and XV to XX. In some embodiments, each TR contains an amino acid sequence selected from formulas XXIII to XXXV.

[0097] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 25, 26, or 28. SEQ ID NO: 25 YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 26 MDIRSLLNPLPSPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDRARSEEDAVVVKMEVDD, SEQ ID NO: 28 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR Where X1X2 are RVDs selected from the following groups: HD (recognizes C), NN (recognizes A and G), NG (recognizes T), NS / NI (recognizes A), NK (recognizes G), KS / KG (recognizes T), and HI (recognizes A and G).

[0098] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 1, 2 or 27, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 99.5% identity with any of SEQ ID NO: 1, 2 or 27, for example, having a different RVD.

[0099] The inventors also identified DNA-binding TRPs of SEQ ID NO: 3-6 and 9-22. Therefore, this disclosure also provides a TRP comprising an amino acid sequence selected from SEQ ID NO: 3-6 and 9-22, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with any of SEQ ID NO: 3-6 and 9-22, preferably with TR unchanged.

[0100] 3. Fusion peptides This disclosure provides a fusion polypeptide comprising a TRP fused to a fusion partner, the TRP being capable of binding DNA in a sequence-specific manner.

[0101] In some embodiments, the TRP is a programmed / programmable TALE-like polypeptide. In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises the following amino acid sequence: FX1X2DNLX3X4X5X6X7X8X9GX 10 X 11 X 12 X 13 LX 14 X 15 LLX 16 X 17 X 18 PX19 LX 20 X 21 X 22 G in, X1 is G or S, X2 is N or P, X3 is V or I, X4 is K or R, X5 is V or I, X6 is A or G, X7 is A or G, X8 and X9 are repeating variable double residues (RVD), X 10 For G, S, or A, X 11 For A, Q, or K, X 12 For Q, H, or K, X 13 For A or T, X 14 For Q or D, X 15 For A or T, X 16 For D or Q, X 17 For K, V, or R, X 18 For G, Y, or S, X 19 For A, K, Q, R, or T, X 20 For R, A, or T, X 21 It is Q or N, and X 22 It can be either A or G.

[0102] In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises an amino acid sequence selected from formulas I to VII and XXII to XXIV: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in, X1 is G or S, X2 and X3 are repeating variable double residues (RVD), X4 is G or S, X5 is A or Q, X6 is H or Q, X7 is A or T, and X8 is K or R. FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in, X1 is T or K, X2 is Q, E or R, X3 is A or G, X4 / X5 is RV / D, X6 is N or D, X7 is T or K, X8 is A or V, X9 is T, R or K, X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 Let X be A, D, or T, and X be... 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; FX1X2DNLX3KVAAX4X5GGX6QALLDKX7PX8LRX9AG (XXII) Where X1 is G or S, X2 is N or P, X3 is V or I, X4 and X5 are RVD, X6 is A or Q, X7 is G or S, X8 is A or T, and X9 is Q or N. GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) Where X1X2 is RVD; as well as FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) Where X1X2 is RVD.

[0103] In some implementations, RVD is selected from: HD (recognizing C), NN (recognizing A and G), NG (recognizing T), NS / NI (recognizing A), NK (recognizing G), KS / KG (recognizing T), and HI (recognizing A and G).

[0104] In some embodiments, each tandem repeat comprises an amino acid sequence selected from formulas III to XXI and XXIII to XXXV: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXVI) FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG (XXVII) FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG (XXVIII) FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXIX) FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG (XXX) FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXI) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXII) FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG (XXXIII) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG (XXXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXV) Where X1X2 is RVD.

[0105] The N-terminal and C-terminal regions can be those from TALEs known in the art (e.g., AvrBs3). The N-terminal and C-terminal regions can be those from TRPs in the STAR family of this disclosure.

[0106] In some embodiments, the N-terminal region comprises the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or the amino acid sequence of positions 1-39 of SEQ ID NO: 27. In some embodiments, the N-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the N-terminus. For example, the N-terminal region comprises the amino acid sequence of positions 5-53, 10-53 or 15-53 of SEQ ID NO: 1; the N-terminal region comprises the amino acid sequence of positions 5-97, 10-97, 15-97 or 20-97 of SEQ ID NO: 2; or the N-terminal region comprises the amino acid sequence of positions 5-39, 10-39 or 15-39 of SEQ ID NO: 27.

[0107] In some embodiments, the C-terminal region comprises the amino acid sequence of positions 351-382 of SEQ ID NO: 1, the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or the amino acid sequence of positions 469-500 of SEQ ID NO: 27. In some embodiments, the C-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the C-terminus. For example, the C-terminal region comprises the amino acid sequence of positions 351-380, 351-375 or 351-370 of SEQ ID NO: 1; the N-terminal region comprises the amino acid sequence of positions 392-750, 392-700 or 392-650 of SEQ ID NO: 2; or the C-terminal region comprises the amino acid sequence of positions 469-495, 469-490 or 469-485 of SEQ ID NO: 27.

[0108] In some embodiments, the N-terminal region comprises an amino acid sequence having at least 60%, 70%, 80%, 90%, 92%, 94%, 96%, or 98% identity with the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identity with the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, or 97% identity with the amino acid sequence of positions 1-39 of SEQ ID NO: 27.

[0109] In some embodiments, the C-terminal region contains an amino acid sequence that has at least 60%, 70%, 80%, 90%, 93%, or 96% identity with the amino acid sequence of positions 351-382 of SEQ ID NO: 1, at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, or 96% identity with the amino acid sequence of positions 469-500 of SEQ ID NO: 27.

[0110] In some embodiments, the programmed / programmable TALE-like peptide comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more TRs. Preferably, the programmed / programmable TALE-like peptide comprises 3-17, 4-17, 5-17, 6-17, 7-17, 8-17 or 9-17 TRs.

[0111] In some implementations, TR is the same except for RVD. In some implementations, TR differs from each other among the residues other than RVD.

[0112] In some implementations, TR is derived from the natural TR of a single TRP. In some implementations, TR is derived from SEQ ID NO: 1. In some implementations, TR is derived from SEQ ID NO: 2. In some implementations, TR is derived from SEQ ID NO: 27.

[0113] In some embodiments, each TR contains an amino acid sequence selected from formulas I, III, and IV. In some embodiments, each TR contains an amino acid sequence selected from formulas II and V through VII. In some embodiments, each TR contains an amino acid sequence selected from formulas XXII through XXIV.

[0114] In some embodiments, each TR contains an amino acid sequence selected from formulas III, IV, and VIII to XIV. In some embodiments, each TR contains an amino acid sequence selected from formulas V to VII and XV to XX. In some embodiments, each TR contains an amino acid sequence selected from formulas XXIII to XXXV.

[0115] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 25, 26, or 28. SEQ ID NO: 25 YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 26 MDIRSLLNPLPSPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDRARSEEDAVVVKMEVDD, SEQ ID NO: 28 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR Where X1X2 are RVDs selected from the following groups: HD (recognizes C), NN (recognizes A and G), NG (recognizes T), NS / NI (recognizes A), NK (recognizes G), KS / KG (recognizes T), and HI (recognizes A and G).

[0116] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 1, 2 or 27, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 99.5% identity with any of SEQ ID NO: 1, 2 or 27, for example, having a different RVD.

[0117] In some embodiments, the TRP comprises an amino acid sequence selected from SEQ ID NO: 3-6 and 9-22, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with any of SEQ ID NO: 3-6 and 9-22, preferably with TR remaining unchanged.

[0118] In some embodiments, the fusion chaperone is a polypeptide that provides nuclease activity, including, for example, an active (generating double-strand breaks) or partially active (cleavage enzyme) nuclease. The nuclease may be an endonuclease. In some embodiments, the fusion chaperone is another polypeptide or domain, such as Clo51 or FokI nuclease, to generate double-strand breaks (Guilinger et al., Nature Biotechnology Volume 32, Issue 6, June 2014.

[0119] In some implementations, the fusion partner is a polypeptide that provides an indirect increase in transcriptional activity by acting directly on the target DNA or on a polypeptide associated with the target DNA (such as histones or other DNA-binding proteins).

[0120] In a further embodiment, the fusion partner is a polypeptide that provides methyltransferase activity, demethylase activity, acetyltransferase activity, deacetylase activity, kinase activity, phosphatase activity, ubiquitin ligase activity, deubiquitination activity, adenylation activity, deadenylation activity, SUMOylation activity, deSUMOylation activity, ribosylation activity, deribosylation activity, myristylation activity, or demyristylation activity.

[0121] In a further embodiment, the fusion partner is a polypeptide (e.g., a transcription activator or a fragment thereof, a protein or a fragment thereof that recruits a transcription activator, a small molecule / drug-responsive transcription regulator, etc.) that directly provides increased transcription of the target nucleic acid.

[0122] In some implementations, the fusion chaperone is a polypeptide that directs the editing of one or more bases in a polynucleotide sequence, such as a site-specific deaminase that can alter the identity of a nucleotide, for example, changing CG to TA or AT to GC (Gaudelli et al., 2017, Programmable base editing of AT to GC in genomic DNA without DNA cleavage). Nature , 551(7681): 464-471; Nishida et al., 2016, Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science , 353 (6305): 1248; Komor et al., 2016, Programmable editing of atarget base in genomic DNA without double-stranded DNA cleavage. Nature , 533(7603):420-424).

[0123] The fusion peptide may contain, for example, deaminases (such as, but not limited to, cytidine deaminase, adenine deaminase, APOBEC1, APOBEC3A, BE2, BE3, BE4, ABEs, etc.). In some embodiments, the fusion chaperone includes a base editing repair inhibitor and a glycosylation enzyme inhibitor (e.g., a uracil glycosylation enzyme inhibitor (preventing uracil removal)).

[0124] The fusion peptide may also contain a heterologous nuclear localization sequence (NLS). The heterologous NLS described herein may have sufficient strength to drive the fusion peptide to accumulate in a detectable amount in the eukaryotic nucleus. The NLS may contain one (monophonic) or more (e.g., biphasic) short sequences (e.g., 2 to 20 residues) of basic, positively charged residues (e.g., lysine and / or arginine). For example, the NLS may be present at the N-terminus or C-terminus of the fusion peptide. For example, two or more NLS sequences may be present at both the N-terminus and C-terminus of the fusion peptide.

[0125] 4. Polynucleotides and constructs for expressing TRP or fusion peptides The TRP or fusion peptide disclosed herein can be isolated from a recombinant source, wherein the host cell is genetically modified to express the nucleotide sequence encoding the peptide. Alternatively, the TRP or fusion peptide can be produced using a cell-free protein expression system or synthesized.

[0126] Therefore, this disclosure also provides an isolated polynucleotide comprising a nucleotide sequence encoding the TRP or fusion polypeptide of this disclosure.

[0127] TRP peptides or fusion peptides can be expressed in cells. Cells include, but are not limited to, human, non-human, animal, bacterial, fungal, insect, yeast, and plant cells.

[0128] The standard recombinant DNA and molecular cloning techniques used in this paper are well known in the art and are described more fully in Sambrook et al., Molecular Cloning: A Laboratory Manual; Cold Spring Harbor Laboratory: Cold Spring Harbor, NY (1989). The transformation methods are well known to those skilled in the art and are described below.

[0129] Vectors and constructs, including circular plasmids and linear polynucleotides, are also provided, containing the target polynucleotide and optional other components, including adapters, adaptors, and regulatory sequences.

[0130] In some implementations, the vector contains an expression cassette encoding a TRP or fusion peptide.

[0131] In some implementations, the expression of TRP or fusion peptides is driven by constitutive promoters, inducible promoters, or spatiotemporally specific promoters.

[0132] 5. Gene editing using fusion peptides This disclosure provides a recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a TRP fused to a fusion partner and capable of binding DNA in a sequence-specific manner.

[0133] This disclosure provides a composition comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a TRP fused to a fusion partner capable of binding DNA in a sequence-specific manner.

[0134] This disclosure provides a method for introducing double-strand breaks in a target polynucleotide, comprising the step of contacting the polynucleotide with a recombinant gene editing system, the recombinant gene editing system comprising a fusion polypeptide comprising a TRP fused to a fusion mate capable of binding DNA in a sequence-specific manner.

[0135] This disclosure provides a method for modifying a genomic sequence in a cell (such as a eukaryotic cell), comprising the step of introducing a recombinant gene editing system into the cell, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a TRP fused to a fusion partner and capable of binding DNA in a sequence-specific manner.

[0136] In some embodiments, the TRP is a programmed / programmable TALE-like polypeptide. In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises the following amino acid sequence: FX1X2DNLX3X4X5X6X7X8X9GX 10 X 11 X 12 X 13 LX 14 X 15 LLX 16 X 17 X 18 PX 19 LX 20 X 21 X 22 G in, X1 is G or S, X2 is N or P, X3 is V or I, X4 is K or R, X5 is V or I, X6 is A or G, X7 is A or G, X8 and X9 are repeating variable double residues (RVD), X 10For G, S, or A, X 11 For A, Q, or K, X 12 For Q, H, or K, X 13 For A or T, X 14 For Q or D, X 15 For A or T, X 16 For D or Q, X 17 For K, V, or R, X 18 For G, Y, or S, X 19 For A, K, Q, R, or T, X 20 For R, A, or T, X 21 It is Q or N, and X 22 It can be either A or G.

[0137] In some embodiments, the programmed / programmable TALE-like polypeptide comprises an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each tandem repeat comprises an amino acid sequence selected from formulas I to VII and XXII to XXIV: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in, X1 is G or S, X2 and X3 are repeating variable double residues (RVD), X4 is G or S, X5 is A or Q, X6 is H or Q, X7 is A or T, and X8 is K or R. FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in, X1 is T or K, X2 is Q, E or R, X3 is A or G, X4 / X5 is RV / D, X6 is N or D, X7 is T or K, X8 is A or V, X9 is T, R or K, X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 Let X be A, D, or T, and X be... 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; FX1X2DNLX3KVAAX4X5GGX6QALLDKX7PX8LRX9AG (XXII) Where X1 is G or S, X2 is N or P, X3 is V or I, X4 and X5 are RVD, X6 is A or Q, X7 is G or S, X8 is A or T, and X9 is Q or N. GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) Where X1X2 is RVD; as well as FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) Where X1X2 is RVD.

[0138] In some implementations, RVD is selected from: HD (recognizing C), NN (recognizing A and G), NG (recognizing T), NS / NI (recognizing A), NK (recognizing G), KS / KG (recognizing T), and HI (recognizing A and G).

[0139] In some embodiments, each tandem repeat comprises an amino acid sequence selected from formulas III to XXI and XXIII to XXXV: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXVI) FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG (XXVII) FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG (XXVIII) FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXIX) FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG (XXX) FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXI) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXII) FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG (XXXIII) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG (XXXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXV) Where X1X2 is RVD.

[0140] The N-terminal and C-terminal regions can be those from TALEs known in the art (e.g., AvrBs3). The N-terminal and C-terminal regions can be those from TRPs in the STAR family of this disclosure.

[0141] In some embodiments, the N-terminal region comprises the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or the amino acid sequence of positions 1-39 of SEQ ID NO: 27. In some embodiments, the N-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the N-terminus. For example, the N-terminal region comprises the amino acid sequence of positions 5-53, 10-53 or 15-53 of SEQ ID NO: 1; the N-terminal region comprises the amino acid sequence of positions 5-97, 10-97, 15-97 or 20-97 of SEQ ID NO: 2; or the N-terminal region comprises the amino acid sequence of positions 5-39, 10-39 or 15-39 of SEQ ID NO: 27.

[0142] In some embodiments, the C-terminal region comprises the amino acid sequence of positions 351-382 of SEQ ID NO: 1, the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or the amino acid sequence of positions 469-500 of SEQ ID NO: 27. In some embodiments, the C-terminal region can be truncated by removing 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more residues from the C-terminus. For example, the C-terminal region comprises the amino acid sequence of positions 351-380, 351-375 or 351-370 of SEQ ID NO: 1; the N-terminal region comprises the amino acid sequence of positions 392-750, 392-700 or 392-650 of SEQ ID NO: 2; or the C-terminal region comprises the amino acid sequence of positions 469-495, 469-490 or 469-485 of SEQ ID NO: 27.

[0143] In some embodiments, the N-terminal region comprises an amino acid sequence having at least 60%, 70%, 80%, 90%, 92%, 94%, 96%, or 98% identity with the amino acid sequence of positions 1-53 of SEQ ID NO: 1 (optionally including an additional N-terminal M if desired), an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, or 98% identity with the amino acid sequence of positions 1-97 of SEQ ID NO: 2, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, or 97% identity with the amino acid sequence of positions 1-39 of SEQ ID NO: 27.

[0144] In some embodiments, the C-terminal region contains an amino acid sequence that has at least 60%, 70%, 80%, 90%, 93%, or 96% identity with the amino acid sequence of positions 351-382 of SEQ ID NO: 1, at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with the amino acid sequence of positions 392-762 of SEQ ID NO: 2, or at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, or 96% identity with the amino acid sequence of positions 469-500 of SEQ ID NO: 27.

[0145] In some embodiments, the programmed / programmable TALE-like peptide comprises at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more TRs. Preferably, the programmed / programmable TALE-like peptide comprises 3-17, 4-17, 5-17, 6-17, 7-17, 8-17 or 9-17 TRs.

[0146] In some implementations, TR is the same except for RVD. In some implementations, TR differs from each other among the residues other than RVD.

[0147] In some implementations, TR is derived from the natural TR of a single TRP. In some implementations, TR is derived from SEQ ID NO: 1. In some implementations, TR is derived from SEQ ID NO: 2. In some implementations, TR is derived from SEQ ID NO: 27.

[0148] In some embodiments, each TR contains an amino acid sequence selected from formulas I, III, and IV. In some embodiments, each TR contains an amino acid sequence selected from formulas II and V through VII. In some embodiments, each TR contains an amino acid sequence selected from formulas XXII through XXIV.

[0149] In some embodiments, each TR contains an amino acid sequence selected from formulas III, IV, and VIII to XIV. In some embodiments, each TR contains an amino acid sequence selected from formulas V to VII and XV to XX. In some embodiments, each TR contains an amino acid sequence selected from formulas XXIII to XXXV.

[0150] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 25, 26, or 28. SEQ ID NO: 25 YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 26 MDIRSLLNPLPSPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDRARSEEDAVVVKMEVDD, SEQ ID NO: 28 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR Where X1X2 are RVDs selected from the following groups: HD (recognizes C), NN (recognizes A and G), NG (recognizes T), NS / NI (recognizes A), NK (recognizes G), KS / KG (recognizes T), and HI (recognizes A and G).

[0151] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 1, 2 or 27, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99% or 99.5% identity with any of SEQ ID NO: 1, 2 or 27, for example, having a different RVD.

[0152] In some embodiments, the TRP comprises an amino acid sequence selected from SEQ ID NOs: 3-6 and 9-22, or an amino acid sequence having at least 60%, 70%, 80%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, or 99.5% identity with any of SEQ ID NOs: 3-6 and 9-22, preferably with TR remaining unchanged.

[0153] In some embodiments, the fusion chaperone is a polypeptide that provides nuclease activity, including, for example, an active (generating double-strand breaks) or partially active (cleavage enzyme) nuclease. The nuclease may be an endonuclease. In some embodiments, the fusion chaperone is another polypeptide or domain, such as Clo51 or FokI nuclease, to generate double-strand breaks (Guilinger et al., Nature Biotechnology Volume 32, Issue 6, June 2014.

[0154] In some implementations, the fusion partner is a polypeptide that provides an indirect increase in transcriptional activity by acting directly on the target DNA or on a polypeptide associated with the target DNA (such as histones or other DNA-binding proteins).

[0155] In a further embodiment, the fusion partner is a polypeptide that provides methyltransferase activity, demethylase activity, acetyltransferase activity, deacetylase activity, kinase activity, phosphatase activity, ubiquitin ligase activity, deubiquitination activity, adenylation activity, deadenylation activity, SUMOylation activity, deSUMOylation activity, ribosylation activity, deribosylation activity, myristylation activity, or demyristylation activity.

[0156] In a further embodiment, the fusion partner is a polypeptide (e.g., a transcription activator or a fragment thereof, a protein or a fragment thereof that recruits a transcription activator, a small molecule / drug-responsive transcription regulator, etc.) that directly provides increased transcription of the target nucleic acid.

[0157] In some implementations, the fusion chaperone is a polypeptide that directs the editing of single or multiple bases in a polynucleotide sequence, such as a site-specific deaminase that can alter the identity of a nucleotide, for example, from CG to TA or from AT to GC (Gaudelli et al., Programmable base editing of AT to GC in genomic DNA without DNA cleavage). Nature (2017); Nishida et al., Targeted nucleotide editing using hybrid prokaryotic and vertebrate adaptive immune systems. Science 353(6305)(2016); Komor et al., Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533(7603) (2016):420-424).

[0158] The fusion peptide may contain, for example, deaminases (such as, but not limited to, cytidine deaminase, adenine deaminase, APOBEC1, APOBEC3A, BE2, BE3, BE4, ABEs, etc.). In some embodiments, the fusion chaperone includes a base editing repair inhibitor and a glycosylation enzyme inhibitor (e.g., a uracil glycosylation enzyme inhibitor (preventing uracil removal)).

[0159] The fusion peptide may also contain a heterologous nuclear localization sequence (NLS). The heterologous NLS described herein may have sufficient strength to drive the fusion peptide to accumulate in a detectable amount in the eukaryotic nucleus. The NLS may contain one (monophonic) or more (e.g., biphasic) short sequences (e.g., 2 to 20 residues) of basic, positively charged residues (e.g., lysine and / or arginine). For example, the NLS may be present at the N-terminus or C-terminus of the fusion peptide. For example, two or more NLS sequences may be present at both the N-terminus and C-terminus of the fusion peptide.

[0160] In some implementations, the recombinant gene editing system also includes heterologous polynucleotides, such as expression cassettes, transgenes, donor DNA, or polynucleotide modification templates.

[0161] Methods for introducing polynucleotides or peptides or polynucleotide-protein complexes into cells or organisms are known in the art, including but not limited to microinjection, electroporation, stable transformation methods, transient transformation methods, ballistic particle acceleration (particle bombardment), whisker-mediated transformation, Agrobacterium-mediated transformation, direct gene transfer, virus-mediated introduction, transfection, transduction, cell-penetrating peptides, direct protein delivery mediated by mesoporous silica nanoparticles (MSN), local application, sexual hybridization, sexual breeding, and any combination thereof.

[0162] 6. Artificial transcription factors (TFs) This disclosure provides a randomized library of artificial TFs comprising multiple cells, each cell carrying a vector containing a nucleotide sequence encoding a polypeptide comprising an N-terminal region, two or more tandem repeats, and a C-terminal region.

[0163] In some implementations, each TR contains the following amino acid sequence: FX1X2DNLX3X4X5X6X7X8X9GX 10 X 11 X 12 X 13 LX 14 X 15 LLX 16 X 17 X 18 PX 19 LX 20 X 21 X 22 G in X1 is G or S, X2 is N or P, X3 is V or I, X4 is K or R, X5 is V or I, X6 is A or G, X7 is A or G, X8 and X9 are repeating variable double residues (RVD), X 10 For G, S, or A, X 11 For A, Q, or K, X12 For Q, H, or K, X 13 For A or T, X 14 For Q or D, X 15 For A or T, X 16 For D or Q, X 17 For K, V, or R, X 18 For G, Y, or S, X 19 For A, K, Q, R, or T, X 20 For R, A, or T, X 21 It is Q or N, and X 22 It can be either A or G.

[0164] In some embodiments, each TR contains an amino acid sequence selected from formulas I to VII and XXII to XXIV: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in X1 is G or S, X2 and X3 are repeating variable double residues (RVD), X4 is G or S, X5 is A or Q, X6 is H or Q, X7 is A or T, and X8 is K or R. FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in X1 is T or K, X2 is Q, E or R, X3 is A or G, X4 / X5 is RV / D, X6 is N or D, X7 is T or K, X8 is A or V, X9 is T, R or K, X 10 For H or Y; X 11 For A, P, or Q, X 12 For T or R, X 13 Let X be A, D, or T, and X be... 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; FX1X2DNLX3KVAAX4X5GGX6QALLDKX7PX8LRX9AG (XXII) Where X1 is G or S, X2 is N or P, X3 is V or I, X4 and X5 are RVD, X6 is A or Q, X7 is G or S, X8 is A or T, and X9 is Q or N; GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) Where X1X2 is RVD; as well as FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) Where X1X2 is RVD, The tandem repeats are randomized between cells.

[0165] In some embodiments, each TR contains an amino acid sequence selected from formulas III to XXI and XXIII to XXXV: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG (XXIII) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG (XXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXVI) FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG (XXVII) FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG (XXVIII) FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXIX) FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG (XXX) FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXI) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXII) FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG (XXXIII) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG (XXXIV) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXXV) Where X1X2 is RVD.

[0166] In some implementations, each tandem repeat contains an amino acid sequence selected from formulas I, III, and IV.

[0167] In some embodiments, each tandem repeat comprises an amino acid sequence selected from formulas II and V through VII.

[0168] In some implementations, each TR contains an amino acid sequence selected from formulas XXII to XXIV.

[0169] In some implementations, each TR contains an amino acid sequence selected from formulas III, IV, and VIII through XIV.

[0170] In some implementations, each TR contains an amino acid sequence selected from formulas V to VII and XV to XX.

[0171] In some implementations, each TR contains an amino acid sequence selected from formulas XXIII to XXXV.

[0172] In some embodiments, the polypeptide contains at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more TRs. Preferably, the polypeptide contains 6-8 TRs.

[0173] In some embodiments, the programmed / programmable TALE-like polypeptide comprises the amino acid sequence of SEQ ID NO: 25, 26, or 28. SEQ ID NO: 25 YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 26 MDIRSLLNPLPSPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDRARSEEDAVVVKMEVDD, SEQ ID NO: 28 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSPDNLIKVAAX1X2GGAQALQALLDKSPALRQAG FGPDNLVKVAAX1X2GGQQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGPDNLVKVAAX1X2GGAQALQALLDKGPTLRQAG FSPDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FGNDNLVKVAAX1X2GGQQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRNAG FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG FSNDNLVRIGGX1X2GAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR Where X1X2 are RVDs selected from the following groups: HD (recognizes C), NN (recognizes A and G), NG (recognizes T), NS / NI (recognizes A), NK (recognizes G), KS / KG (recognizes T), and HI (recognizes A and G).

[0174] 7. Methods for screening DNA-bound TRP This disclosure provides a method for screening DNA-binding TRPs, comprising: i) Retrieve proteins from the TR family from the database; ii) Predicting TRs in proteins and screening for TRPs; and iii) Build a model and use the model to analyze the screened TRPs in order to obtain predicted TRPs that can bind to DNA.

[0175] The database in step i) can be UniProtKB / Swiss-Prot. The prediction in step ii) can be performed using XSTREAM, with the repeating region sequence as the query to comprehensively search for all repeating regions of the TR. TR clusters can be obtained using MMseqs2.

[0176] In step iii), a model can be built by integrating two or more pre-trained transformer models (such as ProteinBERT, ProtTans, and ESM), and then the integrated model can be trained using a dataset generated from a database (such as UniProtKB / Swiss-Prot). For example, the process of generating the positive dataset is as follows: 1) Search for all DNA-binding proteins (DBPs) in UniProtKB / Swiss-Prot by keyword search for "DNAbinding"; 2) Filter out proteins that do not contain the GO term "DNA binding"; 3) Remove sequences with fewer than 60 amino acids (aa) or more than 4000 aa; 4) Remove sequences containing the characters "X|x" and "J|j"; 5) Group the remaining sequences with sequence similarity ≥50% using MMseqs2. The non-DNA-binding protein (NDBP) dataset is generated by the following steps: 1) Retrieve sequences with sequence similarity <25% to any sequence in the positive dataset; 2) Filter out any proteins whose descriptions contain "DNA binding". 3) Sequences containing the characters "X|x" and "J|j" were removed; 4) The remaining sequences with a sequence similarity ≥ 50% were grouped using MMseqs2. A total of 12,989 DBPs and 121,455 NDBPs were generated. To balance the number of DBPs, 12,989 NDBPs were randomly selected for training. The resulting dataset was named the UniSwiss25978 dataset.

[0177] The method may further include the step of generating a test dataset for testing the established model.

[0178] In some implementations, the method further includes: iv) Test the binding of the obtained TRP to DNA.

[0179] In some implementations, tests can be performed using conventional assays, such as those using the B1H system, BLI system, and / or SELEX system.

[0180] This disclosure further provides an apparatus for performing the method of screening DNA-binding TRPs disclosed herein, comprising: i) Units used to retrieve proteins in the TR family from a database; ii) Units used to predict TR in the protein and screen for TRP; and iii) Used to build a model and analyze the screened TRPs to obtain units that predict which TRPs can bind to DNA.

[0181] This disclosure further provides an apparatus for performing the method of screening DNA-binding TRPs disclosed herein, comprising: a) Processor, b) Memory coupled to the processor, and c) Computer programs stored in memory. The processor executes a computer program to implement the method of screening DNA-binding TRPs disclosed herein.

[0182] This disclosure provides a computer-readable storage medium storing executable instructions that cause a processor to perform the method of screening DNA-binding TRPs disclosed herein.

[0183] This disclosure provides a computer program product comprising a computer program executed by a processor to implement the method of screening DNA-binding TRPs disclosed herein.

[0184] Example Example 1. Materials and Methods 1.1. Computational Flowchart for Characterizing Repetitive Proteins All unique proteins were downloaded from UniRef100 and the NCBI-nr database (frozen in August 2022). A three-step process was implemented to characterize proteins with tandem repeats (TR).

[0185] First, XSTREAM was used to identify proteins containing periodic repeats. XSTREAM uses a short seed extension method to detect protein TRs (Newman and Cooper, 2007, XSTREAM: a practical algorithm for identification and architecture modeling of tandem repeats in protein sequences). BMC Bioinformatics , 8, 382).

[0186] Secondly, in order to classify novel and known TRs, we first collected well-known TR families from the literature and then searched for organized proteins within these families in the UniProtKB / Swiss-Prot database (Bairoch and Apweiler, 2000, The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000). Nucleic acids research, 28, 45-48; Chakrabarty and Parekh, 2022, DbStRiPs: Database of structural repeats in proteins. Protein Science , 31, 23-36; Andrade et al., 2001, Protein repeats: structures, functions, and evolution. Journal of structural biology , 134, 117-131; and Kamel et al., 2021, REP2: A Web Server to Detect Common Tandem Repeats in Protein Sequences. Journal of Molecular Biology , 433, 166895).

[0187] Next, we extracted repetitive region sequences from these known TR proteins. These sequences were then used as queries to perform a comprehensive search of all repetitive regions of the TRs identified in the initial steps. Hit sequences showing at least 30% identity and 70% coverage were designated as putative known TRs, and further validated by domain annotation. Finally, MMseqs2 (Steinegger and Soding, 2017, MMseqs2 enables sensitive protein sequence searching for the analysis of massive datasets) was used. Nature Biotechnology (35, 1026-1028) and use the following parameters to obtain known and novel TR clusters: -c 0.7 --min-seq-id 0.3 --cov-mode 1 --cluster-mode 1. Considering that repetitions can occur in non-integer multiples and their boundaries often do not coincide, we use two common sequence instances in the clustering process.

[0188] To investigate the underlying reasons for the significantly higher incidence of orphaned TR in this new type, we conducted the following analysis. First, we analyzed the genus-level assembly count through a two-step process: 1) identifying the origin species and its corresponding genus. 2) retrieving the genus-level assembly count using the EntrezDirect E-utilities research command (Kans, 2023, ...). Entrez programming utilities help [Internet](National Center for Biotechnology Information (US)). Secondly, we assessed the integrity of the genome assemblies to which TRs belong. Specifically, for TRs within each cluster size range, we retrieved the corresponding genome assemblies and then randomly selected 2,000 of them for analysis using BUSCO (Simão et al., 2015, BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs). Bioinformatics (31, 3210-3212) assess genome integrity.

[0189] 1.2. Generation of protein identifiers We developed a hierarchical naming scheme for the “Target TR” dataset, which includes three classification levels: cycle length, cluster size, and number of units. Specifically, 1) all TRs are first divided into 21 major groups based on their cycle length (15, 16, ..., 35). 2) Within each cycle length group, proteins are further sorted by cluster size (C1 to C100). 3) When two clusters have the same cycle length and cluster size, their ranking is determined by the median number of units. For example, in the unique identifier “TR_15_C1_1”, TR represents tandem repeat, 15 indicates a cycle length of 15, C1 represents “cluster 1”, and 1 represents the position of the TR within its respective cluster, determined by sorting the number of units in ascending order.

[0190] 1.3. Correlation Analysis The correlation analysis in this study used Pearson correlation coefficients and was performed in the RStudio environment (Allaire, 2012, RStudio: integrated development environment for R). Boston , MA, 770, 165-171).

[0191] 1.4. Sequence Complexity Evaluation Sequence complexity within repeats is represented by the normalized Shannon entropy score (NSS) (Sander and Schneider, 1991, Database of homology-derived protein structures and the structural meaning of sequence alignment). Proteins, 9, 56-68; and Shannon, 1948, Amathematical theory of communication. The Bell system technical journal (27, 379-423). The Shannon score is calculated using the common sequence of the TR. Specifically, the Shannon entropy score is defined as the negative of the sum of the products of the amino acid frequencies (Pi) in the classical repeat sequence and the binary logarithms of those frequencies (log2(Pi)). The calculated Shannon score is then normalized by the length of the common sequence.

[0192] 1.5. Selection of Identity Threshold To determine an effective identity threshold for distinguishing different TRP families, we collected 100 members from eight well-known TRP families (i.e., ZNF, ANK, ARM, LRR, PPR, TPR, WD40, and TALE). Inter-family repeats and intra-family repeats represent repeats between different TRP families and within the same TRP family, respectively. Identity values ​​for inter-family and intra-family repeats were calculated using the EMBOSS needle program with default parameters (Rice et al., 2000, EMBOSS: the European Molecular Biology Open Software Suite). Trends in genetics (16, 276-277). Use two copies of the shared sequence as input.

[0193] 1.6. Training Dataset Generation Based on the UniProtKB / Swiss-Prot database (Bairoch and Apweiler, 2000, The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000). Nucleic acids research(28, 45-48) The training dataset was constructed to train the proposed predictor. The positive dataset was generated as follows: 1) All DBPs were retrieved from UniProtKB / Swiss-Prot by searching for the keyword "DNA binding"; 2) Proteins without the GO term "DNA binding" were filtered out; 3) Sequences <60 amino acids (aa) or >4,000 aa were removed; 4) Sequences containing the characters "X|x" and "J|j" were removed; 5) The remaining sequences with sequence similarity ≥ 50% were grouped using MMseqs2. The non-DNA binding protein (NDBP) dataset was generated by the following steps: 1) Sequences with sequence similarity <25% to any sequence in the positive dataset were retrieved; 2) Any proteins whose descriptions contained "DNA binding" were filtered out; 3) Sequences containing the characters "X|x" and "J|j" were removed; 4) The remaining sequences with sequence similarity ≥ 50% were grouped using MMseqs2. A total of 12,989 DBPs and 121,455 NDBPs were generated. To balance the number of DBPs, 12,989 NDBPs were randomly selected for training and named the UniSwiss25978 dataset.

[0194] 1.7. Generation of Test Dataset From the Protein Data Bank (PDB) (Sussman et al., 1998, Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. Acta Crystallographica Section D: Biological CrystallographyIndependent test datasets (54, 1078-1084) were obtained, meeting similar criteria to those used in generating the training dataset, but with the following additions: 1) Sequences with ≥25% sequence identity to any other sequence in the training dataset used by the evaluated tool were filtered out; 2) The remaining sequences with ≥25% sequence identity were grouped using MMseqs2. After meeting the filtering criteria, a total of 359 representative DBPs and 364 representative NDBPs were generated. Subsequently, 300 DBPs and NDBPs were randomly selected to obtain the PDB600 dataset. The performance of each tool was evaluated using several parameters, including accuracy, specificity, recall, precision, F1 score, and area under the curve (AUC). Receiver operating characteristic (ROC) curves were plotted using the R "pROC" package (Robin et al., 2011, pROC: an open-source package for R and S+ to analyze and compare ROC curves). BMC bioinformatics , 12, 1-8).

[0195] 1.8. Construction of DNA-binding protein prediction model A DNA-binding protein prediction model, PLM-DBPPred, was constructed by integrating pre-trained transformer models from ProteinBERT, ProtTans, and ESM (Brandes et al., 2022, ProteinBERT: a universal deep-learning model of protein sequence and function). Bioinformatics , 38, 2102-2110; Elnaggar et al., 2021, Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence , 44, 7112-7127; and Rives et al., 2021, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences (118, e2016239118). More specifically, the prediction results of these three models are obtained independently, and the average of the three probabilities is calculated as the final result. The threshold is set to 0.5.

[0196] ProteinBERT ProteinBERT is pre-trained in a self-supervised manner using approximately 106 million protein sequences from the UniRef90 database and their corresponding functional annotations from a gene ontology database. First, the model takes protein sequences as input and processes them through a feature encoding step. Then, the resulting embedding vectors undergo attention computation through a dedicated attention layer. To reduce feature dimensionality, the output of the attention layer is transformed by a fully connected layer and then activated using a sigmoid activation function. Furthermore, additional attention layers and fully connected component layers are stacked sequentially after the initial attention layer. Besides using a double-layer attention for the classifier, fully connected layers are also utilized for comparison. The final output of the model is a probability score indicating whether a given protein exhibits DNA-binding ability.

[0197] During model training, all layers of the pre-trained model were initially frozen, except for the attention and fully connected layers, which were trained for a maximum of 10 epochs. Subsequently, all layers were unfrozen and trained for an additional three epochs. To ensure optimal learning, the dynamic learning rate adjustment technique ReduceLROnPlateau was employed. The loss was calculated using binary cross-entropy, and an early stopping strategy was applied to prevent overfitting. Using the pre-trained model, the entire training process was completed on a single Tesla P100-PCIE-12GB GPU, with the learning rate and batch size set to 0.0001 and 32, respectively.

[0198] ProtTrans For the ProtTrans-based architecture, six models were trained by combining two ProtTrans architectures (ProtT5-XL-UniRef50 and ProtBERT-BFD) with three different classifiers. ProtT5-XL-UniRef50 is a variant of the T5 model (Raffel et al., 2020, Exploring the limits of transfer learning with aunified text-to-text transformer). The Journal of Machine Learning ResearchProtBERT-BFD (21, 5485-5551) is designed as an encoder-only architecture. It was initially trained on the BFD dataset and then fine-tuned on the UniRef50 dataset. ProtBERT-BFD is based on the BERT architecture and trained on the BFD dataset. After PLM, three classifiers are used to process the embedding information. The first is a regular MLP, which serves as the baseline for classification performance. The second is a Light Attention (LA) classifier, which uses a pair of 1D convolutions to extract key information (Stärk et al., 2021, Light attention predicts protein location from the language of life). Bioinformatics Advances The third classifier is biLSTM_TextCNN, which is known for its effectiveness in sentiment classification (Jiang et al., 2022, Research on sentiment classification for netizens based on the BERT-BiLSTM-TextCNN model). PeerJ Computer Science (8, e1005). Ultimately, this yields the probability representing whether a given protein has DNA-binding ability.

[0199] During model training, all layers of the pre-trained ProtT5-XL-UniRef50 model were frozen. Classifier training lasted 10 epochs. The batch size was set to 64. The learning rate was initialized to 0.00005, and the Adam optimizer was used with weight decay at 0.001. Similar to the operation in ProteinBERT, ReduceLROnPlateau was used to dynamically adjust the learning rate based on validation set performance. The entire fine-tuning process was completed using a single Tesla P100-PCIE-12GB GPU.

[0200] ESM For ESM-based models, we selected the esm2_t30_150M_UR50D and esm2_t33_650M_UR50D models for encoding. After obtaining the protein sequence embeddings through the processing stage, these embeddings were subsequently imported into the MLP, LA, and biLSTM_TextCNN classifiers, as outlined in ProtTrans.

[0201] During model training, we fine-tuned the model by unfreezing the pre-embedding norms, post-embedding norms, and Roberta heads used for masked language modeling layers in the pre-trained ESM model, and combined it with three different classifiers. Training spanned 10 epochs with a batch size of 120 and an initial learning rate of 0.01. Optimization was performed using AdamW with a weight decay of 0.0001. We also used ReduceLROnPlateau to dynamically adjust the learning rate based on the validation set results. The entire fine-tuning process was completed on eight Tesla P100-PCIE-12GB GPUs.

[0202] 1.9. Gene Ontology (GO) Enrichment Analysis of DBP and NDBP The dataset for DBP GO enrichment is generated in a similar manner to the test dataset used to evaluate the DBP prediction tool, except that sequences similar to any sequences in the training dataset used by the evaluation tool are not removed. PANNZER2 (Törönen et al., 2018, PANNZER2: a rapid functional annotation webserver) is applied. Nucleic acids research , 46, W84-W88) were used for GO term identification, and then terms with PPV greater than 0.6 were submitted to clusterProfiler for GO enrichment (Wu et al., 2021, clusterProfiler 4.0: Universal enrichment tool for interpreting omics data). The Innovation , 2,100141).

[0203] 1.10. DNA-binding domain (DBD) research and protein domain annotation DNA-binding domains (DBDs) are obtained from a variety of sources, including databases such as AnimalTFDB 4.0 (Shen et al., 2023, AnimalTFDB 4.0: a comprehensive animal transcription factor database updated with variation and expression annotations). Nucleic Acids Research , 51, D39-D45), PlantTFDB 5.0 (Tian et al., 2020, PlantRegMap: charting functional regulatory maps in plants. Nucleic acids research, 48, D1104-D1113) and the DNA-binding domain (DBD) database (Wilson et al., 2008, DBD – taxonomically broad transcription factor predictions: new content and functionality). Nucleic acids research (36, D88-D92). Furthermore, manual keyword searches were performed in the Pfam database for terms such as “DNA binding” and “transcription factor” to ensure comprehensive coverage. Companion domains other than DBDs were extracted based on domain architecture files from the Pfam FTP site. The top 10 companion domains enriched for each DBD were collected and GO enriched using GO terms derived from Pfam2GO (http: / / www.geneontology.org / external2go / pfam2go). DNA-related domain (DRD) numbers were extracted and manually validated based on the following keywords: DNA binding, RNA binding, nucleic acid binding, transcription factor, nuclease, helicase, deaminase, integrate, ligase, transposase, polymerase, methylase, recombinase. Proteins annotated with any DNA-related domains were added to the DRD list.

[0204] By using hmmsearch (Finn et al., 2011, HMMER web server: interactive sequence similarity searching). Nucleic acids research The Pfam database (V35.1) functional domain entries for all repetitive proteins were scanned using the "--cut_ga" option (W29-W37).

[0205] 1.11. Enrichment Analysis of Different Functional Annotations Enrichment analysis of different functional annotations is referenced in GO enrichment (Zheng and Wang, 2008, GOEAST: a web-based software toolkit for gene ontology enrichment analysis). Nucleic acids research(36, W358-W363). Taking DBP enrichment as an example, where A is the DBP count in cluster X, B is the total number of proteins in cluster X, C is the DBP count outside cluster X, and D is the total number of proteins outside cluster X. For a specific function annotation, the advantage ratio is calculated as (A / B) / (C / D). Subsequently, the log2 advantage ratio (LR) is determined to represent the enrichment score of a specific function for each cluster (Zheng and Wang, 2008, as above). A larger LR indicates that the function is more enriched within the cluster than in the overall sample, and vice versa.

[0206] 1.12. Protein Structure Prediction Protein structure model prediction was performed using the ColabFold v1.5.2-patch platform with default parameters (https: / / colab.research.google.com / github / sokrypton / ColabFold / blob / main / AlphaFold2.ipynb) (Mirdita et al., 2022, ColabFold: making protein folding accessible to all). Nature methods , 19, 679-682).

[0207] 1.13. Construction of the phylogenetic tree Phylogenetic trees for all TRP families were constructed using repeating region sequences, except for the MOON and STAR families, which were generated using full-length sequences. All phylogenetic trees were constructed using the FastTree procedure (Price et al., 2010, FastTree 2 – approximately maximum-likelihood trees for large alignments). PloS one , 5, e9490).

[0208] 1.14. Prediction of Transcription Activation Domains Transcription activation domains can be predicted using ADpred (Erijman et al., 2020, A high-throughputscreen for transcription activation domains reveals their sequence features and permits prediction by deep learning). Molecular cell , 78, 890-902. e896).

[0209] 1.15. Prediction of Type III Secretory Signals Type III secretion signals (T3SS) are predicted using the EffectiveDB software suite (http: / / effectivedb.org) (Eichinger et al., 2016, EffectiveDB—updates and novel features for a better annotation of bacterial secreted proteins and Type III, IV, VI secretion systems). Nucleic acids research (44, D669-D674). The threshold is set to 0.99.

[0210] 1.16. Structural Comparison Structural comparisons were performed using the R "Bio3d" package by calculating the root mean square deviation (RMSD) matrix (Grant et al., 2021, The Bio3D packages for structural bioinformatics). Protein Sci , 30, 20-30). The resulting matrix was then visualized using the headmap.2 function in the R "gplots" package (Warnes et al., 2009, gplots: Various R programming tools for plotting data. R package version, 2,1).

[0211] 1.17. Target Gene Analysis The following steps were used to identify the probable target genes for DNA-binding TRP: 1) Promoter sequences were extracted (-1,000 bp to the TSS site). 2) Enriched motifs were searched in the promoter region using the FIMO program in MEME Suite. 3) Probable target genes were extracted from target hits. 4) GO annotation of all proteins in the genome was performed using PANNZER2 (Törönen et al., 2018, as above). 5) Potential target genes were enriched for GO using clusterProfiler (Wu et al., 2021, as above).

[0212] 1.18. Taxonomic breadth of TR clusters To estimate the taxonomic breadth of each TR cluster, we calculated the Lowest Common Ancestor (LCA) of its members using Taxonkit v0.13.0 (Shen and Ren, 2021, TaxonKit: A practical and efficient NCBI taxonomy toolkit. Journal of genetics and genomics, 48, 844-850). Specifically, we first clustered all 4,575,091 identified TRs using MMseqs2 (Steinegger and Soding, 2017, ibid.) with the following parameters: -c 0.7 --min-seq-id 0.3 --cov-mode 1 --cluster-mode 1. Subsequently, LCA was calculated for all TR clusters containing at least 10 members.

[0213] 1.19. Protein Expression and Purification Proteins were expressed and purified in vitro using the pET28a vector with N-terminal His and SUMO tags and consistent reading frames. Expression constructs for all candidates were synthesized by BGI after codon optimization in *E. coli*. The assembled genes were constructed into the pET28a expression vector and then transformed into *E. coli* cells. For protein expression, 0.1 mM IPTG was added when the OD600 reached 0.6–0.8, followed by incubation at 16°C for 18 h. The cell pellet was resuspended in binding buffer (50 mM Tris-HCl, 500 mM NaCl) and then sonicated (200 W, 3 s on / 3 s off, on ice, for 10 min). The supernatant was loaded onto a HisTrap HP column (GE Healthcare), prior to which the column was washed with 5 column volumes of binding buffer. Proteins were eluted with binding buffer containing 300 mM imidazole. Molecular sieve chromatography was then performed using Superose 6 or Superose 12 HR16 / 50 to obtain high-purity proteins.

[0214] 1.20. Screening based on biological layer interferometry (BLI) The development of screening methods based on biolayer interferometry (BLI) is based on the discovery by Marklund et al. that DNA-binding proteins can rapidly bind and dissociate on various sequences, but can effectively rebind to target sequences through searching, resulting in macroscopically specific binding (Marklund et al., 2022, Sequence specificity in DNA binding is mainly governed by association). Science(375, 442-445). Therefore, binding of DBP to random dsDNA libraries is possible, and the binding signal can be reflected in the response captured by the BLI experiment.

[0215] The feasibility test based on BLI screening begins by comparing the responses between established DBPs and NDBPs. DBPs include: 1) T_AAVS1, a design for targeting... AAVS1 TALE loci (Hockemeyer et al., 2011, Genetic engineering of human pluripotent cells using TALE nucleases). Nature biotechnology , 29, 731-734), and 2) Zif268, a natural zinc finger protein (Christy and Nathans, 1989, DNA binding site of the growth factor-inducible protein Zif268). Proceedings of the National Academy of Sciences , 86, 8737-8741). NDBP includes: 1) PUM1, human Pumilio homolog 1 protein (Spassov and Jurecic, 2002, Cloning and comparative sequence analysis of PUM1 and PUM2 genes, human members of the Pumilio family of RNA-binding proteins). Gene , 299, 195-204), 2) SUMO, a soluble tag (Kim et al., 2002, Versatile protein tag, SUMO: its enzymology and biological function). Journal of cellular physiology, 191, 257-268), and 3) ULP1, ubiquitin-like specific protease 1. Two different loading densities (1 or 5 nM) were tested. The experimental protocol was set up as follows: BLI-based screening was performed on an Octet RED system (Fortebio) at 25°C. The Ni-NTA (NTA) biosensor was immersed in BLI assay buffer (PBS, 0.02% Tween, pH 7.4) for 10 min before use. An 88 bp dsDNA library with a random 60 nucleotide region was annealed and diluted to 100 nM. The dsDNA product was purified by PAGE gel extraction. Three replicate series were used for each protein. The procedure steps were: baseline 60 sec, loading His-tagged protein, baseline again 60 sec, binding 60 sec, dissociation 30 sec, regeneration 30 sec. Response scores were collected and used for further data processing.

[0216] 1.21. SELEX Filtering SELEX assays were performed based on several previous studies (Bouvet, 2009, DNA-Protein Interactions. Springer, pp. 139-150; and Miller et al., 2011, A TALE nuclease architecture for efficient genome editing). Nature biotechnology (29, 143-148). The dsDNA library contained a 20 nt random sequence. For each round of selection, 200 ng of purified protein was incubated with 3 μL of Dynabeads™ His-Tag separation and pull-down magnetic beads (Thermo, #10103D) at 25°C for 30 min. After removing unbound protein, 2 μg of the dsDNA library was added to the protein-magnetic bead complex and incubated with 100 μL of SELEX buffer (50 mM Tris, 150 mM NaCl, 20 mM KCl, 2.5 mM MgCl2, 10 μM ZnCl2, 0.05% Tween20, 0.01% BSA, 20 μg / mL dIdC). After incubation at room temperature for 1 hour, the sample was washed five times with SELEX buffer to remove unbound dsDNA. The bound dsDNA was then amplified for the next round of selection. After five cycles, the recovered DNA fragments were cloned and sequenced.

[0217] 1.22. B1H Screening Plasmids used in the B1H system were purchased from Addgene (#12609 as the reporter plasmid, #18039 for expressing the target protein). 18 nt random sequence library reporters selected for B1H were constructed using T4 ligase, with sticky ends generated by NotI and EcoRI. For B1H assays, the reporter vector containing the 18 nt random sequence library and the protein expression vector were co-transformed into US0 *E. coli* strain via electroporation. Successful binding events were enriched on plates containing 10 mM 3-AT. Enriched cells were scraped from the plates, followed by plasmid extraction and sequencing.

[0218] 1.23. GFP activation verification A GFP activation verification system was constructed based on the B1H screening system. Specifically, a GFP activation verification system was constructed based on the B1H screening system. HIS3 marker gene replacement GFP Reporter genes ( Figure 3 (D) Furthermore, an 18-nt random sequence upstream of the promoter was modified to target enrichment motifs. Subsequently, the binding affinity of the tested motif was indicated by a flow cytometry-detectable GFP signal.

[0219] 1.24. GFP Inhibition Verification The GFP inhibition system is a modified version of the PAM-SCANR system, which was originally developed to identify functional PAM diversity across CRISPR-Cas systems (Leenay et al., 2016, Identifying and visualizing functional PAM diversity across CRISPR-Cas systems). Molecular cell (62, 137-147). Specifically, the following were deleted: lacI and lacI The promoter was replaced with the target enriched motif, and the PAM sequence on the reporter plasmid was replaced with the target enriched motif. Figure 3 D). After co-transformation with the protein expression plasmid and the reporter plasmid, successful binding of the tested motif to the protein blocked GFP expression, resulting in a decrease in the GFP signal level detected by flow cytometry.

[0220] 1.25. EMSA For EMSA assays, FAM-labeled and unlabeled probes were generated via oligonucleotide annealing. Different protein concentration series were designed for each reaction, with specific / non-specific unlabeled probes added to the last two lanes as binding competitors. Binding reactions were incubated at room temperature for 30 minutes and separated on 6% native PAGE gels at 80 V for 1 hour. The gels were visualized using a Bio-Rad UV transilluminator.

[0221] 1.26. Negative staining sample preparation, data collection, and 2D classification averaging The proteins were purified according to the procedure described above. All complexes were reconstructed by incubating the protein and DNA at a 1:10 ratio in a binding buffer containing 10 mM Tris-HCl (pH 8.0) and 150 mM NaCl on ice for 30 minutes.

[0222] For the preparation of negatively stained samples, all samples were diluted to a final concentration of 0.5 μM and negatively stained in 2% (w / v) uranyl acetate solution following a standard deep staining protocol on porous carbon-coated EM copper mesh covered with a continuous carbon thin layer (Liu and Wang, 2011, Single particle electron microscopy reconstruction of the exosome complex using the random conical tilt method. JoVE (Journal of Visualized Experiments), e2574). The negatively stained samples were then imaged using a FEI Tecnai-F20 electron microscope operating at an accelerating voltage of 200 kV. Images were taken at a nominal magnification of 50,000x with a defocus range of 2.5 to 3.5 μm. These electron micrographs were recorded using a Gatan Ultrascan4000 4k × 4k CCD camera.

[0223] The acquired micrographs were then processed in CryoSPARC as negative staining data (Punjani et al., 2017, cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination). Nature methods (14, 290-296). Use the Manual Picker to select individual particles and determine the average particle size using no-reference 2D classification.

[0224] 1.27. BLI for Integrating Dynamics BLI experiments were performed on an Octet RED system (Fortebio) at 25°C. Before use, the streptavidin (SA) sensor was immersed in assay buffer (PBS, 0.02% Tween, pH 7.4) for 10 min. The complementary labeled oligonucleotide pair (5' biotin) was annealed in 10 × annealing buffer (Thermo, TECH TIP # 45). The dsDNA product was purified by PAGE gel extraction. A six-point concentration series was designed for each protein, with the last one diluted with PBST buffer as a control. The program settings were as follows: baseline 60 s, biotin-conjugated dsDNA loading 120 s, baseline again 120 s, binding 120 s, dissociation 120 s. The dissociation rate constant (k) was determined using Octet Data Analysis software by globally fitting a model that considered the complete step time and assumed a 1:1 binding ratio. dis ) and binding rate constant (k on ).

[0225] 1.28. Assembly of artificial STAR proteins The one-step construction of artificial STAR proteins follows the previously reported Golden Gate method with several improvements (Cermak et al., 2011, Efficient design and assembly of custom TALEN and other TALeffector-based constructs for DNA targeting). Nucleic acids research (39, e82-e82). Specifically, the duplicate modules on pHD-1 are replaced with those derived from... Pq Each module of the STAR1 has a unique protruding end. Furthermore, we... LacZ Insert at both ends Pq The N-terminal and C-terminal regions of STAR1 modified the pFUS_A vector. LacZ Two internal [features] were strategically placed at the 5' and 3' ends. BsaI Sites are created to facilitate the linearization of the vector with enzymes, thereby generating suitable protrusions for the integration of repetitive modules.

[0226] 1.29. Cell lines Human 293T cells were cultured in DMEM medium containing 10% fetal bovine serum (Invitrogen, Carlsbad, USA) and 1% penicillin / streptomycin (Millipore, TMS-AB2-C). All cells were incubated in a humidified incubator at 37°C with 5% carbon dioxide.

[0227] 1.30. CUT & Tag Experimental Procedure A plasmid vector for protein expression in human 293T cells was constructed, containing an Ef1a promoter and a C-terminal 3x FLAG tag. The P2A-GFP sequence following the FLAG tag was used to detect plasmid transfection efficiency. 293T cells were transfected at a rate of 1.2 × 10⁻⁶ cells / cells. 5 Cells were seeded at a density of 10 cells / well in 24-well plates. For transfection, 1 μg of plasmid and Lipofectamine 2000 (Invitrogen) were used. Transfected cells were cultured at 37°C for 24 hours, followed by Western blotting and CUT&Tag assays. CUT&Tag assays were performed using a CUT&Tag assay kit (TD903, Vazyme Biotech) following the manufacturer's instructions.

[0228] 1.31. RNA-seq experimental procedure A plasmid vector for expressing STAR-based ATF in human 293T cells was constructed, containing a CMV promoter in pVAX1 (Snapgene). The P2A-GFP sequence following the VPR activation domain was used to assess plasmid transfection efficiency. 293T cells were transfected at 8 × 10⁻⁶ cells / cells. 5 Cells were seeded at a density of 10 cells / well in 6-well plates. For transfection, 2.5 μg of plasmid and Lipofectamine 2000 (Invitrogen) were used. Transfected cells were cultured at 37°C for 48 hours, followed by RNA extraction. Total RNA was extracted using TRIzol reagent and sequenced by Annoroad Genetics (Beijing) Co., Ltd.

[0229] 1.32. Sequencing and Data Processing SELEX, B1H, and CUT&TAG samples were sequenced on an Illumina NovaSeq PE150 platform. RNA-seq samples were sequenced on an MGIDNBSEQ T7 platform. Read trimming and filtering were performed using Trimmomatic version 0.33 (Bolger et al., 2014, Trimmomatic: a flexible trimmer for Illumina sequencedata). Bioinformatics , 30, 2114-2120).

[0230] For SELEX and B1H data, after separating samples by barcode, random DNA library regions of each sample were extracted by mapping fixed flanking sequences.

[0231] For CUT & Tag data, BWA-MEM was used to align paired-end reads to the hg19 genome assembly (Li, 2013, Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.arxiv preprint arxiv:1303.3997). Then, MACS 2.0 was used for peak retrieval (Zhang et al., 2008, Model-based analysis of ChIP-Seq (MACS)). Genome biology (9, 1-9), with a parameter of -p<0.01. A control library was generated by treating wild-type 293T cells with a FLAG-tagged antibody. Overlap analysis was performed using bedtools (Quinlan and Hall, 2010, BEDTools: a flexible suite of utilities for comparing genomic features). Bioinformatics , 26, 841-842). Enriched motifs for all types of data were discovered and visualized using the HOMER program (Heinz et al., 2010, Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities). Mol Cell , 38, 576-589).

[0232] For RNA-seq data, hisat2 was used to align paired end reads to the human hg19 genome (Kim et al., 2019, Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype). Nature biotechnology (37, 907-915). Gene counts were generated using StringTie and input into DESeq2 for expression analysis (Pertea et al., 2015, StringTie enables improved reconstruction of a transcriptome from RNA-seq reads). Nature biotechnology,33, 290-295; and Love et al., 2014, Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome biology (15, 1-21). Differential gene expression was considered when the corrected p-value was <0.05 and the log2 fold change was >|1|. Motif enrichment was performed using the promoter region from -2,000 bp upstream to +500 bp downstream of the transcription start site (TSS). Gene set enrichment analysis was performed using the clusterProfiler R package in R studio (Wu et al., 2021, ibid.).

[0233] 1.33. Quantitative and Statistical Analysis The dataset was evaluated using GraphPad Prism 8. All values ​​are expressed as mean ± standard deviation. Statistical significance for the B1H validation analysis was determined using a two-tailed Student's Choice test. t The test confirmed the results, while the comparison of RNA-seq read counts was performed using the Wilcoxon signed-rank test. In all cases, p A value <0.05 is considered statistically significant. p A value <0.01 is considered to be statistically significant.

[0234] Example 2: Identification and Characterization of Tandem Repeat Proteins This embodiment was carried out to identify and characterize proteins containing tandem repeats (TRs) that are capable of binding DNA in a sequence-specific manner.

[0235] like Figure 1 As shown in Figure A, XSTREAM was used with default parameters (number of units ≥ 2 and period ≥ 3) to identify proteins containing periodic repeats by searching the UniRef100 and NCBI-nr databases, identifying a total of 43,648,764 TRs. By aligning all repeat units and assessing the conservation of each column, variable amino acid (VAA) values ​​were derived. Figure 1 B).

[0236] To classify known and novel TRs, we first collected well-known TR families from the literature and then searched the UniProtKB / Swiss-Prot database for organized proteins within these families.

[0237] TRs exhibiting the following characteristics were defined as low quality and excluded from further analysis: 1) incomplete TRs, 2) compound repeats, and 3) TRs consisting of only two repeat units. In total, 4,575,091 TRs passed this filtering procedure, and the unit count and cycle number of these proteins were plotted (see [link to analysis]). Figure 1 C).

[0238] Next, we predicted TRs within these proteins using XSTREAM and used repeat region sequences as queries to comprehensively search all repeat regions of the TRs identified in the initial step. Repeat region sequences of these proteins were used as queries to search for homologs among all identified TRs. Hit sequences with at least 30% identity and 70% coverage were designated as putative known TRs and further validated by domain annotation. Finally, known and novel TR clusters were obtained using MMseqs2 with the following parameters: -c 0.7 --min-seq-id 0.3 --cov-mode 1 --cluster-mode 1. Considering that repeats can occur in non-integer multiples and their boundaries often do not overlap, the clustering process used instances of two common sequences.

[0239] Impressively, of the 4,575,091 TRs, only 4.4% (199,240) were classified as known TRs, while the remaining 95.6% (4,375,851) were novel.

[0240] Next, we clustered known and novel TRs using a 30% identity threshold and a 70% coverage threshold (Materials and Methods), and they showed different patterns. For example... Figure 1 As shown in D, known TRs are grouped into a much smaller number of unique clusters, with their periods concentrated between 15 and 55 amino acids. In contrast, the number of proteins and clusters in the novel TRs are much closer to each other. Figure 1 D). On average, the number of proteins per cluster in the novel TR is much smaller than that in the known TR ( Figure 1 E). Notably, a large proportion (~71%) of the novel clusters consist of only one protein member, while no orphan TRs were found in the known TR clusters. Figure 1 E). The significant prevalence of orphan TRs in the new type may be attributed to limited genome diversity sequencing coverage or poor genome assembly quality. To determine the primary cause, we quantified two metrics at different cluster sizes: (1) the genus-level assembly count for each protein-originating species available from NCBI, and (2) the integrity of the genome assembly to which the TR belongs. We observed a corresponding increase in the median number of genus-level assemblies as the cluster size increased ( Figure 1F, PCC=0.97, p-value <0.01). However, the pattern of assembly integrity showed only a weak trend (PCC=0.66, p-value=0.2285). These findings suggest that the high proportion of orphaned TRs is mainly due to insufficient genome sequencing data from neighboring species, and that a highly diverse range of novel TR families remains to be explored.

[0241] In summary, these analyses indicate that only a very small portion of TR diversity has been studied to date, and our current understanding of TR is heavily biased towards a few well-studied families. A comprehensive exploration of TR is needed.

[0242] Example 3: Development of a model for predicting DNA binding This embodiment is being conducted to develop a model for predicting DNA binding containing the TR protein (TRP).

[0243] First, we employed several filtering criteria to characterize potentially programmable repeats (TRs) in terms of binding DNA sequences. Specifically, based on known nucleic acid binding families with important biological functions, we focused on repeats within proteins of 15–35 amino acids in length, occurring 6–40 times (15 ≤ period ≤ 35, 6 ≤ unit number ≤ 40) to achieve proper folding with reasonably complex binding interfaces. To ensure diversity within clusters, clusters were excluded if all members had identical repeat regions. To assess diversity within repeats, a normalized Shannon entropy score (NSS) was introduced by calculating the amino acid diversity of shared sequences (Materials and Methods). In total, 125,624 TRs with defined period and unit number and an NSS greater than 0.6 passed the filtering procedure and were named “Target TRs”. We developed a hierarchical naming scheme for “Target TR” proteins, incorporating multiple classification levels: period, cluster size, and unit number.

[0244] For predicting DNA-binding proteins (DBPs), various computational techniques have been developed by integrating multiple information sources, such as sequence, structural features, and physicochemical properties. We conducted a comprehensive analysis of all available DBP prediction models developed since 2010, finding that 78% of the models were based on traditional machine learning (TML), while 22% were based on deep learning (DL). Furthermore, we found that TML-based models tend to use smaller training datasets compared to DL-based models. Notably, approximately 67% of the TML models used the PDB1075 dataset (Liu et al., 2014, iDNA-Prot|dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reducedalphabet profile into the general pseudo amino acid composition). PloS one (9,e106691). Therefore, our work focuses on two aspects: 1) generating high-quality datasets; 2) developing robust models.

[0245] First, we assembled a high-quality training dataset from the UniProtKB / Swiss-Prot database, containing 12,989 DBPs and an equal number of non-DBPs (NDBPs), collectively referred to as the UniSwiss25978 dataset (version 2022.10). Figure 2 A). In addition, a separate test dataset, named PDB600, was generated, consisting of 300 DBPs and 300 NDBPs derived from the PDB database. We then investigated three well-known PLMs: ProteinBERT (Brandes et al., 2022, ibid.), ProtTrans (Elnaggar et al., 2021, ibid.), and ESM (Rives et al., 2021, ibid.), each with a unique architecture and pre-trained on different datasets. We integrated these PLMs with different classification layers. For the ProteinBERT-based models, we evaluated two architectures: one with two attention layers and the other integrating only fully connected layers. For the ProtTrans-based and ESM-based models, we combined each model with three different classifiers: MLP, light attention (LA), and biLSTM_TextCNN. Finally, probabilities were obtained for each model via a feedforward network (Materials & Methods). We fine-tuned these models using the UniSwiss25978 dataset and then evaluated their performance using the PDB600 dataset.

[0246] Based on the test results, the optimal combination was selected for each language model: ProteinBERT with two attention layers, ProtT5 with an LA classifier, and ESM2_30 with an LA classifier. ROC-AUC analysis showed that all three models had robust predictive ability in recognizing DBP, with the ProtTrans-based model producing the best result and an impressive AUC of 0.88. To further improve model performance, we employed an ensemble learning strategy to integrate all three models. Interestingly, combining these models improved classification performance, achieving an AUC score of 0.90. Figure 2 B). This newly developed converter-based model combines three PLMs and is named PLM-DBPPred (Protein Language Model Enhanced DNA Binding Protein Prediction). Figure 2 A). Next, we compare it with existing state-of-the-art methods. PLM-DBPPred demonstrates the best performance, excelling in AUC, accuracy, and F1 score, with values ​​of 0.90, 0.82, and 0.80, respectively. Figure 2 C).

[0247] Using PLM-DBPPred, we performed DBP prediction on full-length proteins in the “target TR” dataset, resulting in a candidate list containing 8,865 TRPs distributed across 1,640 clusters, all with potential DNA binding capacity.

[0248] Example 4: Screening for DNA-bound TRP 4.1. Selecting TRP Candidates To further prioritize candidates with higher DBP potential, we investigated well-studied DBPs to gain a deeper understanding of the biological processes they are involved in and their domain characteristics. We first performed gene ontology (GO) analysis using a dataset containing 1,000 DBPs and 1,000 NDBPs from the PDB database. As expected, DBPs were highly enriched for DNA-related metabolism and function compared to NDBPs. Based on the enriched GO terms, we traced them to higher-level terms, including GO:0090304 (nucleic acid metabolism), GO:0003700 (DNA-binding transcription factor activity), GO:0003676 (nucleic acid binding), GO:0140640 (catalytic activity, acting on nucleic acids), and GO:0005634 (nucleus). TRPs annotated with these terms and their subterms were grouped into a list of DNA-binding-related GOs (DBGOs).

[0249] Given that DBPs typically consist of a DNA-binding domain (DBD) and specific functional domains, we analyzed all previously identified DBD-associated chaperone domains and found that they are enriched in nucleic acid metabolic processes and occur in various forms, such as hydrolases, nucleases, and methyltransferases. TRPs annotated with these domains were included in the DRD list.

[0250] Given that proteins within clusters often share similar functional annotations and features, we established several selection strategies at the cluster level. For the functional annotation-based strategy, we calculated the enrichment score for each functional annotation (see Materials and Methods). Clusters with specific functions were designated when the corresponding log2 dominance ratio (LR) score exceeded 0. Based on this, overlapping clusters among DBP, DBGO, and DRD were generated. Figure 2 (D) We first selected 51 clusters with DBP functional enrichment. Next, we selected 15 clusters with DRD / DBGO functional enrichment. Furthermore, considering that all these functional annotations rely on existing knowledge, we further selected 9 clusters without any functional annotations, based solely on basic protein characteristics such as small size. Due to the varying cluster sizes and the difficulty of synthesizing genes containing tandem repeats, a variable number (1–3) of proteins were selected within each cluster for gene synthesis. Overall, genes encoding 100 members distributed across 75 clusters were successfully synthesized. Notably, in some cases, when proteins contained DRDs, only gene fragments encoding the repeating regions were synthesized.

[0251] 4.2. Experimental Screening and Validation Considering the unique characteristics of different DBPs and their varying compatibility with specific functional assays, we employed in vivo and in vitro screening strategies to identify novel DNA-binding TRPs (TRPs). Figure 3 A).

[0252] For the in vivo platform, the well-established B1H (bacterial one-hybrid) system was used, which translates DNA binding events into bacterial survival. We cloned all 100 candidate genes into the B1H vector and screened them as described previously (Meng et al., 2005, A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factors). Nature biotechnology , 23, 988-994). Among all 100 candidates, we successfully identified four specific binding motifs for different proteins ( Figure 3 A, B).

[0253] For the in vitro platform, we first evaluated the expression level of each protein in *E. coli* and then further purified the highly expressed proteins. From 100 candidates, we obtained 28 proteins with high purity (…). Figure 3 A). Subsequently, we developed a biological layer interferometry (BLI)-based screening method, validated it by comparing the responses of well-known DBP and NDBP, and determined appropriate experimental conditions (see Materials and Methods). Based on this, we screened 28 purified candidates using BLI and identified eight proteins exhibiting DNA-binding activity at a binding response threshold of 0.1 nm (nm shift in the BLI curve). Next, we performed SELEX (Symplocative Evolution of Ligands with Exponential Enrichment) analysis on these eight candidates (Riley et al., 2014, SELEX-seq: a method for characterizing the complete repertoire of binding site preferences for transcription factor complexes). Hox Genes: Methods and Protocols (255-278) to determine its binding specificity and successfully identified enriched DNA sequence motifs for both candidates ( Figure 3 A, C).

[0254] In total, we identified 11 proteins with DNA-binding activity, six of which showed DNA-binding sequence specificity under specific screening conditions (SEQ ID NO: 1-6). Figure 3 These proteins are classified into the STAR (short TALE-like repeat protein), MOON (marine-derived DNA-binding protein), and pTERF (prokaryotic mTERF-like protein) families.

[0255] To further validate the specific binding activity of the six positive hits obtained from the screening, we performed four different validation assays ( Figure 3D). To validate protein-DNA interactions in vitro, EMSA (electrophoretic mobility shift assay) was used (Hellman and Fried, 2007, Electrophoretic mobility shift assay (EMSA) for detecting protein–nucleic acid interactions. Nature protocols, 2, 1849-1861). Two other methods were modified from published studies (Leenay et al., 2016, ibid.; and Meng et al., 2005, ibid.), in which protein binding to specific DNA sequences induced or inhibited GFP signaling in *E. coli* (see Materials and Methods). Furthermore, CUT&Tag (cut-and-tag fragmentation) technology was used to identify specific DNA binding in mammalian cells (Kaya-Okur et al., 2019, CUT&Tag for efficient epigenomic profiling of small samples and single cells). Nature communications , 10, 1930). All six positive hits were validated in two or more independent assays and were therefore considered true positives, and each protein was assigned a custom name based on its characteristics ( ). Figure 3 (B, C), which will be described in later chapters.

[0256] For each protein, we analyzed sequence signatures, predicted tertiary structures using Alphafold2, and identified homologs. Subsequently, we characterized common features for each family, including validating repeat boundaries and overall protein architecture, visualizing repeat units and secondary structure patterns, and constructing phylogenetic trees. In some cases, additional homologs were synthesized to further elucidate the binding properties of the family. Figure 3 D).

[0257] Example 5: Characterization of the STAR family This embodiment is performed to characterize the STAR family ( Pq STAR1 (SEQ ID NO: 1) and Asp STAR1 (SEQ ID NO:2)).

[0258] 5.1. Characterization of STAR protein Through B1H screening, we identified Pq STAR1 and Asp STAR1 DNA motif enrichment ( Figure 3B). Subsequently, we confirmed their specific binding to the identified DNA sequences using EMSA and GFP activation assays. Figure 4 A and B). Further quantification was performed using BLI assay. Pq STAR1 and Asp The binding affinity of STAR1 was found to be approximately 3.8 nM and 224 nM, respectively. Figure 4 C). Notably, negative staining electron microscopy (EM) analysis (see 1.26 of Example 1) showed that the protein structure was more homogeneous in the presence of DNA compared to the condition of the protein alone. Figure 4 D). The 2D classification results show that, Pq STAR1-DNA complex and Asp The average particle sizes of the STAR1-DNA complex are approximately 120 Å and 240 Å, respectively. Figure 4 (See Figure D below).

[0259] Comparative analysis of these sequences revealed a certain degree of conservation at specific amino acid positions. Figure 3 B). Interestingly, we observed high variability at positions 12 and 13, similar to repeatable variable diresidues (RVD) repetitive repeats like TALE (B). Figure 3 B). Therefore, we conducted a comparative analysis of these two newly identified proteins with the classic TALE (using AvrBs3 as a comparison reference). Pq STAR1、 Asp STAR1 and AvrBs3 share certain architectural features, such as the N-terminal type III secretion signaling domain (T3SS) and the central DNA-binding repeat domain. Figure 4 E). However, the differences are also obvious: Pq STAR1 and Asp The full-length STAR1 protein and its non-repetitive regions are shorter (Note: coding prediction). Pq The STAR1 gene lacks a start codon, has fewer repeat units, and both lack a C-terminal transcriptional activation domain. Pairwise comparisons of the full-length protein and repeat regions show low sequence identity with classic TALE ( Figure 4 (F), which also explains why these two proteins were not initially identified as known TRs. Multiple sequence alignment showed that only 9 of the 34 positions were highly conserved across all three proteins, while Pq STAR1 and Asp The RVD of STAR1 is largely the same as that of the classic TALE, and the DNA motifs recognized by each protein are well correlated with the order of the RVD. Figure 4 E), indicating that they share the nucleotide sequence recognition code of TALE.

[0260] Next, we used DISOPRED3 and AlphaFold2 to predict the protein disorder and structure of these two proteins. The repeating regions exhibited an ordered structure, with an average pLDDT (predicted local distance difference test) score exceeding 80, indicating a plausible 3D conformation. The overall structure presented a helical-loop-helical architecture, with the RVD located within the loop region, similar to the classic TALE structure represented by AvrBs3. Figure 4 G). It is worth noting that the quantification using BLI indicates that, Pq STAR1 exhibited high DNA binding affinity (Kd = 3.8 nM) with only 9 replicates. Figure 4 C), which is 100 times higher than the classic TALE with 10 repeats (Kd = 400 nM) (Rinaldi et al., 2017, The effect of increasing numbers of repeats on TAL effector DNA binding specificity). Nucleic acids research , 45, 6960-6970). These data collectively indicate that, despite some similarities in overall structure, these two proteins, which are distantly related to TALE, exhibit higher or comparable DNA-binding affinity, even with fewer units. Therefore, we named these two proteins short TALE-like repeat proteins (STAR), and based on their species of origin and characterization sequence, named them *Pseudomonas aeruginosa* STAR homolog 1 (…). Pq STAR1) and STAR homolog 1 of the genus Mycorrhiza ( Asp STAR1).

[0261] Subsequently, we identified Pq STAR1 and Asp STAR1 originates from all homologous proteins within the genus. Overall, we [context missing - likely referring to a list of homologous proteins]. Pq STAR1 identified seven homologs with clearly repeating boundaries, which are... Asp STAR1 identified two proteins. We constructed a phylogenetic tree using the full-length sequences of these proteins. Figure 4 H). Shared sequence alignment revealed high amino acid conservation within the group, especially in Pq STAR ( Figure 4 Interestingly, both types of STAR proteins have fewer repeats overall compared to classic TALE (H). Figure 4 I) indicates different DNA binding properties.

[0262] Clearly defining the binding sequence identification prompts us to search Pq STAR1 and AspPotential target genes of STAR1. T3SS prediction (see 1.15 of Example 1) indicates... Pq STAR1 and Asp STAR1 can be secreted through a type III secretion system, suggesting that it may play a role in the host genome. Pq The originating species of STAR1, *Pseudomonas aeruginosa*, was isolated from leaf spots of *Quercus mongolica*, indicating that *Quercus mongolica* is a potential host. We analyzed the original genome (...) Oak Pseudomonas GO enrichment analysis was performed on potential target genes in the host genome (Quercus mongolica) and the potential host genome. Interestingly, GO term enrichment was observed only in the host genome gene set, primarily involved in stress response processes, suggesting that... Pq The role of STAR1 as an effector in regulating host stress-related genes. Following the RVD recognition pattern, we investigated another target gene. Pq A similar analysis was performed on STAR4 (SEQ ID NO: 27). Results (see...) Figure 8 )and Pq The results observed by STAR1 are consistent, showing Pq STAR4 is similar to Pq STAR1's specific recognition and binding to DNA strongly suggests Pq STAR participates in the regulation of host stress-related genes. Oak Pseudomonas right Mongolian oak The infection process.

[0263] For testing Pq STAR1 and Asp To address the DNA-binding specificity of STAR1, we designed two variants for each protein by modifying RVDs with 1-3 repeat units. Figure 4 J). Both EMSA and GFP activation assays showed that modification of only two RVDs... Pq STAR1 RVD variants exhibit high specificity for novel target sequences. Figure 4 J). Asp The STAR1 RVD variant exhibited similar specific binding activity, albeit at a lower level. Due to Pq STAR1 exhibits high specificity and strong affinity for binding DNA, and we aimed to evaluate its potential as a programmable DNA binding tool. To this end, we first constructed a set of codes derived from… PqPlasmids containing different repeat modules of the STAR1 backbone were used, and custom repeat arrays were assembled using the Golden Gate method (Cermak et al., 2011, ibid.). Subsequently, three artificial STAR proteins (SEQ ID NO: 29-31) targeting different sequences were assembled, each containing nine repeat units. In a GFP activation assay, all artificial STARs activated GFP expression and exhibited 5' thymine (T0) binding activity independent of the binding site. Figure 4 K), which is what TALE prefers (Boch et al., 2009, Breaking the code of DNA binding specificity of TAL-type III effectors). Science (326, 1509-1512). In summary, these findings indicate... Pq STAR1 can be easily programmed to target specific DNA sequences, regardless of T0.

[0264] Next, we expressed 3x FLAG-tagged [product / product] in 293T cells. Pq STAR1 and Asp STAR1 was analyzed, and cut & tag determination was performed. Western blot analysis showed... Pq STAR1 and Asp STAR1 was expressed efficiently in all samples. Furthermore, the motifs enriched in the CUT&Tag assay were similar to those obtained from the B1H screening and the predicted motifs. Figure 4 L). These data collectively indicate that, Pq STAR1 and Asp STAR1 binds to specific DNA sequences in the human genome, supporting its potential applications in human cells.

[0265] 5.2. Gene activation via STAR-based transcriptional regulators Artificial transcription factors (ATFs) are DNA-binding regulators designed to control the expression of a specific gene or a group of genes in a predetermined manner (Miller et al., 2011, ibid.). Although TALE- and CRISPR-based ATFs have been developed, they typically regulate a single target gene due to the relatively long target sequences required for binding.

[0266] To compare the binding capabilities of STAR and TALE to short sequence motifs, we first constructed artificial STAR and TALEs targeting two well-known TFs (NF-κB and SMAD4). Gene activation assays and EMSA assays showed that the classic TALE-based ATF, containing only 9 repeats, lacked binding activity. In contrast, the STAR-based ATF effectively bound to a 9 bp target sequence, demonstrating the unique advantage of STAR in binding short DNA motifs. For NF-κB, the target sequence is GGGAATCCC, and the STAR-based ATF has the amino acid sequence of SEQ ID NO: 23; for SMAD4, the target sequence is GGCCAGACA, and the STAR-based ATF has the amino acid sequence of SEQ ID NO: 24.

[0267] Next, as described in 1.31 of Example 1, we fused the above-mentioned STAR with the C-terminal VPR activation domain and expressed it in human 293T cells. Figure 5 A and B), and RNA-seq analysis was performed on four groups of samples, including the wild-type (WT) group without plasmid transduction, the VPR-only group, the STAR-targeting NF-κB binding site group (STAR_NF-κB), and the STAR-targeting SMAD4 binding site group (STAR_SMAD4). Robust reproducibility was observed in biological replicates, with the VPR-only group being very similar to the WT group. Figure 5 C). In contrast, the group treated with STAR-based ATF showed a significant deviation from the control group, highlighting the unique transcriptional alterations induced by STAR-based ATF. Figure 5 C). Compared with the VPR-only group, transfection with STAR_NF-κB and STAR_SMAD4 resulted in the upregulation of 1,338 and 2,489 differentially expressed genes (DEGs, padj < 0.05 and fold change > 2), respectively, accounting for the majority (> 70%) of the DEGs. Figure 5 D).

[0268] Next, we performed motif enrichment and gene set enrichment analysis (GSEA, see 1.32 of Example 1) to determine whether these upregulated DEGs were indeed the targets of the designed STAR-based ATF. Notably, the promoter regions of these upregulated DEGs showed motif enrichment very similar to NF-κB and SMAD4 binding motifs. Figure 5 E). Furthermore, the previously disclosed gene set of NF-κB / SMAD4 target genes was significantly enriched in the upregulated genes of the STAR-based ATF group (E). Figure 5 F, H). The reported expression levels of target genes were significantly higher than in the VPR-only group ( Figure 5(G, I). In summary, these data demonstrate that STAR-based ATFs effectively enhance the expression of a large number of endogenous genes by targeting specific regulatory motifs shared by these genes, providing proof of concept for STAR as a platform for constructing ATFs that regulate transcriptional networks.

[0269] Example 6: Characterization of the MOON family This embodiment is used to characterize the MOON family.

[0270] Two proteins that bind to specific DNA sequences were identified through in vitro screening. One of them (XP_022797784.1, SEQ ID NO: 3) was identified from *Pterocarya calyx* (…). Stylophora pistillata ), showing a binding preference for AT-enriched sequences ( Figure 3 C). Therefore, we named it a marine-derived DNA-binding protein (MOON), specifically the MOON homolog 1 of the calyx columnar coral (MoON homolog 1). Sp MOON1). We first confirmed its binding activity to enriched motifs using EMSA (MOON1). Figure 6 A), and its binding capacity was quantified using GFP activation and BLI assays with GC-enriched sequences as a reference. The results showed that... Sp The binding affinity of MOON1 to AT-enriched sequences is approximately 100 times higher than that to GC-enriched sequences. Figure 6 B, C). Negative staining EM analysis showed that AT-enriched DNA could stabilize and homogenize protein particles, suggesting its DNA-binding activity (B, C). Figure 6 D). Further 2D classification results showed that the average particle size was approximately 220 Å (D). Figure 6 (See Figure D below). These data collectively indicate that... Sp MOON1 is a DNA-binding protein that prefers AT-enriched sequences.

[0271] Sp The overall architecture of MOON1 includes the forkhead-associated (FHA) domain at the N-terminus, protein phosphatase 1 (PP1) in the middle, and the repeat region at the C-terminus. Figure 6 E). Since we only synthesized Sp The TR region of MOON1 exhibits DNA-binding ability unaffected by the FHA and PP1 domains. Notably, a similar architecture has been reported in vertebrates, exemplified by the widely used proliferation marker Ki67 protein. Human Ki67 and... SpMOON1 exhibits considerable conservation in its N-terminal region, containing both FHA and PP1 domains. However, they differ significantly in their C-terminal regions. For example, human Ki67 encodes a 16-amino acid repeat of approximately 120 amino acids, including a highly conserved 22-amino acid sequence (TPKEKAQALEDLAGFKELFQTP), termed the Ki67 motif. The Ki67 motif and... Sp MOON1 has similar repeating unit lengths but low homogeneity. Figure 6 F), and there is currently no evidence that these Ki67 repeats possess DNA-binding activity. Interestingly, the C-terminus of Ki67 has a leucine / arginine-rich (LR) domain, which has been experimentally demonstrated to bind AT and enrich DNA in vitro. Therefore, we will Sp The repeat regions of MOON1 were compared with the LR repeat regions of human Ki67, and only 22% identity was found. Because... Sp MOON1 predicts structures with limited confidence, hindering structural comparisons. Figure 6 G). In summary, these findings indicate that Sp MOON1 and Ki67 proteins share N-terminal features, but differ significantly in the region conferring DNA-binding activity. The conservation of their overall protein architecture and binding preferences provides a basis for elucidating... Sp The potential function of MOON1 in calyx-shaped columnar corals provides clues. Further in-depth research is needed to elucidate its significance. Sp Evolutionary link between MOON1 and human Ki67 repeats.

[0272] Next, we searched Sp Other proteins within the MOON1 cluster were also identified, and the search was expanded to include unannotated genomic data obtained from NCBI. A total of 36 homologs were identified, primarily from the class Anthozoa (31 / 36). The domain architecture of these homologs is characterized by an N-terminal FHA domain, a central PP1 domain, and various repeating units at the C-terminus. Figure 6 H). Multiple sequence alignment of the MOON protein repeat units revealed high conservation, except for a few positions in the middle ( Figure 6 H, I). We selected several homologs with different evolutionary distances and different VAA sets at high-variable positions. Specifically, two proteins are from species more distantly related to *Pterocarya calycifolia* (staghorn coral). Acropora digitifera ) and rice coral ( Montipora capitata One of them comes from a calyx-shaped columnar coral. Based on this, we named them respectively. Ad MOON1 (SEQ ID NO: 7), Mc MOON1 (SEQ ID NO: 8) and Sp MOON2 (SEQ ID NO: 4). After purifying these proteins, we performed a BLI-based binding assay. All three homologs showed DNA-binding activity, while only... Sp MOON2 showed enrichment of specific motifs in subsequent SELEX screening. Figure 6 J). Notably, the enriched motifs show a correlation with... Sp Similar AT enrichment patterns were observed in MOON1. Further BLI experiments revealed that its binding affinity to AT-enriched sequences increased 10-fold compared to GC-enriched sequences, indicating a preference for AT-enriched binding. Figure 6 K). Sp MOON1 and Sp The similar binding patterns observed between MOON2 cells can be attributed to their shared set of VAAs, suggesting that binding preferences may be determined by these VAAs.

[0273] In summary, these findings indicate that the MOON protein family possesses broad DNA-binding activity, with some members exhibiting a preference for AT-enriched sequences.

[0274] Example 7: Characterization of the pTERF family This embodiment is used to characterize the pTERF family.

[0275] Another DNA-binding TRP identified through in vitro screening originates from the metagenomic genome of ruminant gut and binds to the motif ACTNNNAGTC ( Figure 3 C). The assembled genome of the resource metagenomics is taxonomically classified as Clostridium, an unclassified bacterium within the phylum Bacillus. We first confirmed the binding activity of this TRP using EMSA and GFP activation assays. Figure 7 A, B). Further BLI assays were performed to quantify binding affinity, showing a Kd value of 1.87 nM ( Figure 7 C). Furthermore, we observed a relatively uniform distribution of protein particles in the presence of DNA ( Figure 7 D). Further 2D classification results showed that the average particle size was approximately 90 Å (D). Figure 7 (See the image below for D).

[0276] Notably, several mTERF (mitochondrial transcription termination factor) motifs were annotated within the repetitive regions. We accordingly named these prokaryotic mTERF-like proteins, specifically pTERF homolog 1 (pTERF1). mTERF motifs are non-tandem repeats and are well-documented as encoding functional nucleic acid-binding proteins in eukaryotes. However, there is currently no evidence of prokaryotic mTERF motif homologs. Therefore, we compared pTERF1 with human mTERF1 (UniProt accession number: MTERF1_HUMAN) and Drosophila mTERF1. Dm Comparative analysis was performed using TTF (UniProt accession number: MTTF_DROME). The repeats in the MTERF1_HUMAN and MTTF_DROME proteins are scattered throughout the protein sequence, while the repeats in pTERF1 are arranged in tandem. Figure 7 E). Significant sequence variability exists among these three proteins ( Figure 7 This is consistent with previous studies. Although the MTERF1_HUMAN and MTTF_DROME sequences have low identity, they retain conserved mTERF motif features, including the conservation of proline at position 8 and the retention of leucines or other hydrophobic amino acids at positions 11, 18, and 25, forming a three-leucine zipper (LZ)-like heptapeptide repeat X3LX3. Notably, these conserved features were not observed in pTERF1. Figure 3 C). Given the lack of obvious sequence similarity, we performed a comparative analysis of pTERF1 and MTERF1_HUMAN (PDB accession number: 3MVA) at the structural level. The repeating regions of pTERF1 showed an ordered structure with an average pLDDT score exceeding 80, and the overall structure of the pTERF1 repeating units was similar to that of a single mTERF motif, both consisting of three α-helices. However, differences were also evident, including differences in the number of helical turns and the local conformation of individual units (C). Figure 7 G).

[0277] To elucidate the evolutionary relationship between pTERF1 and the eukaryotic mTERF family, we performed a comparative analysis of all experimentally characterized mTERF homologs recorded in the UniProtKB / Swiss-Prot database. A total of 30 mTERF protein sequences and their corresponding tertiary structures were retrieved for this study. The tertiary structures of two mTERF proteins, MTERF1_HUMAN and MTERF3_HUMAN, were extracted from the PDB database, while the rest were obtained from the Alphafold2 database. Sequence-level comparisons revealed the existence of several mTERF isotypes, including mTERF1, mTERF2, mTERF3, and mTERF4. Notably, plant mTERF4 and mammalian mTERF4 clustered in different groups, indicating different evolutionary pathways. Notably, pTERF1 showed limited sequence identity with all eukaryotic mTERF proteins, ranging from 9% to 18%. This observation may suggest that pTERF1 has a unique evolutionary trajectory compared to eukaryotic mTERFs. In contrast to sequence-level comparisons, structural analysis (as shown by root mean square deviation (RMSD)) revealed relatively conserved relationships between different subtypes. These findings support a level of functional conservation among these proteins despite limited sequence similarity. Importantly, both mammalian and plant mTERFs, while functioning in mitochondria or chloroplasts, are nuclear-encoded. This emerging evidence supports the role of bacterial genes in mitochondrial origin and evolution. Therefore, the identification of pTERF1 raises the exciting possibility of the existence of the first prokaryotic mTERF-like protein.

[0278] To further characterize the pTERF family, we investigated pTERF1 homologs within the same cluster and recovered five proteins. A deep homology search in the NCBI database identified an additional 14 homologs. Most of these had fewer than 6 units and were therefore not included in the initial target TR list. Figure 7 H). All these proteins were obtained from metagenomically assembled genomes (MAGs). MAG assembly integrity levels ranged from approximately 70% to 100%, and contamination levels ranged from 0% to 10%, indicating moderate to high assembly quality. Furthermore, all MAGs were taxonomically classified into the Kingdom Bacteria, with the majority belonging to the phylum Bacillus. To confirm the widespread presence of pTERF in the metagenomic ecosystem, we further searched for homologous sequences in the MGnify database and successfully identified an additional 27 homologs. To investigate whether other proteins within the pTERF family also possess DNA-binding capabilities, we selected several genes from different positions on the phylogenetic tree for analysis, one of which (pTERF2) was successfully synthesized. Figure 7I). After purification, the protein underwent subsequent SELEX screening. Interestingly, it was enriched on a motif different from pTERF1 ( Figure 7 J). We further validated the DNA binding activity using EMSA and BLI assays (J). Figure 7 The data obtained indicate that different repeating arrangements of pTERF1 and pTERF2 lead to different binding specificities, suggesting potential reprogrammability.

[0279] To gain a preliminary understanding of the function of pTERF proteins, we performed analyses to predict the potential gene functions targeted by pTERF1 and pTERF2 (see 1.17 of Example 1). The results showed that pTERF1 and pTERF2 are associated with similar GO terms involved in RNA metabolic processes. In summary, these data reveal the binding characteristics of the pTERF family and highlight the first class of distant mTERF homologs discovered in prokaryotes, revealing their evolutionary significance and functional diversity. Furthermore, the diverse repetitive arrangements within this protein family facilitate the identification of diverse DNA sequences, suggesting their intriguing reprogrammability potential.

[0280] Example 8: Characterization of other TRPs This embodiment is used to characterize other TRPs.

[0281] As described above, the TRP of SEQ ID NO: 10 was tested by BLI assay, GFP activation and EMSA assay.

[0282] like Figure 9 As shown, this TRP can specifically recognize the motif AATAGCTTTTT.

[0283] Example 9: Randomized libraries of artificial transcription factors (TFs) The one-step construction of randomized libraries follows the previously reported Golden Gate method with several improvements (Cermak et al., 2011, Efficient design and assembly of custom TALEN and other TALeffector-based constructs for DNA targeting). Nucleic acids research 39, e82-e82). Specifically, the duplicate modules on pHD-1 are replaced with those derived from... Pq Each module of the STAR1 has a unique protruding end. Furthermore, we... LacZ Insert at both ends Pq The N-terminal and C-terminal regions of STAR1 modified the pFUS_A vector. LacZTwo internal [features] were strategically placed at the 5' and 3' ends. BsaI Sites are established to facilitate the linearization of the vector with an enzyme, thereby generating suitable overhangs for the incorporation of repetitive modules. The products of these ligation reactions are used to transform *E. coli* to obtain independent transformants. Plasmids are purified from these transformants to construct random libraries.

[0284] result: a) Tool Development: We generated a set of different STAR repeat modules, each with unique DNA-binding specificity. These modules were then randomized and assembled into an artificial STAR library designed to target arbitrary N-mer DNA sequences. These proteins were then fused with various regulatory elements, such as transcription activators, transcription repressors, and epigenetic regulatory elements.

[0285] b) Phenotypic screening: The above tools can be used to screen for functional phenotypes in bacterial, mammalian, and plant cells.

[0286] The sequence of the identified TR-containing protein Repeated sequences are indicated by underline, and RVD is indicated by bold.

[0287] SEQ ID NO: 1 Pq STAR1 (WP_178089108.1) YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAANNGGKQALQALLDKSPALRQAG FGNDNLVKVAAHDGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQHALQALLDKGPALRQAG FGNDNLVKVAANGGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAAHDGGAQALQALLDRGPALRQAG FSNDNLVKVAANGGGAHALQALLDKGPALRQAG FSNDNLVKVAANNGGQQALQTLLDKGPALRQAG FSNDNLVRIGGNGGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 2 Aspen STAR1 (KAG0189736.1) MDIRSLLNPLPSPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKKSGGAKALETLSEKAEALHRAG FSKQEAVAIASNHGGSQALNTVLATHATLTAAG FTHQQIVAIASKGGGSQALNTVLATHAALTAAG FTHQQIVAIASNHGGSQALDKVLATHAPLTAAG FTHRQIVGIASNNGGSQALDTVLVRYAPLRDAY FKHEQIVGIASNIGGSQALDKVLATHAQLTAVG FKHEQIVAIASKGGGSQALDKVLVKYAPLTAAG FTHQQIVAIASNKGGSQALDTVLATHAQLTTAG FAVEDVSAIAAHIGGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDARSEEDAVVVKMEVDD SEQ ID NO: 3 Q MOON1 (XP_022797784.1) MNELIGKIVVIKRNGADGPQFPLRTTECLFGRKPACDIRIQLPNVSLEHALVKVEDKDKIRIVNLSTSNQTLVNGKPITEAEVQHLDVFTICDRSFRFEFASPVLKKAKEKSPKTDTRRSVPLANTPPESQTPKMPVALKGSPRTPATGGSLKENIHRLRRFSMPLQEEDENLTSPASNKPPLSTIFVNAEPQDKSYPQESTEQVSVVKSVEEVKEKTRRNSKWVSFGPVLSPEQFDKDLPPATPVRRGSTPRRGAAASSAKRRRSVLSVPKTDCIRENDDVDEEVDSNAKSSEEQQINNEQVNNNSESFGNETAAVQEKEPRFVGNTPPESGQPSSEHATPKSTKPKSRRSSSSLFEETKLELTGTSRLTASGEVDLVTEERKDTRRLSLRLAQKSPRQQSIVPEMVSSTTNDDEDIDETILDSDDYTSDEEDSMGEETRREGKETKQNTLLQKGVELQKAKKMA TPMRNEIANGVSLRQTKKKMK TPLRKEIEEGVQLNETKKSMA TPLKKDIENGTTLRQIKKKMA TPLKKEIKEGIKLQQTKKKIA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRQTKKKMA TPLRTDILEGVKLRETXYEK SEQ ID NO: 4 Q MOON2 (XP_022797764.1) MK TPLRKEIEEGVQLNGTKKSMA TPLKKDIENGTTLRQIKKKMA TPLKKEIKEGIKLQQTKKKIA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRETKKKMA TPLKKEIENGTSLRQTKKKMA TPLRTDILEGVKLRETKKKLQ TPVRKAIEDGVTLKKTKNKLS TPLQKAIQAKPELRQTKKTMN SEQ ID NO: 5 pTERF1 (MBQ3414883.1) MEELRKVFLNIGYTEEETNNILNSYIFKRNKPETLINKVNDNYN YLLSLGYSKEEIIKITKNSPSIYSLSIEKMQDKIE YMQKLGYSKEDIIKMTKNLPALYGLSIKNMQDKIE DMQELGYSRGDVIKITKSLPGIYSRNIKNIQEKIE DIQKLGYSREDVIKMTKSLPAIYSLSIEYMQEKIE DMQKLGYRKEDVIKMTKNLPAIYGLSIKNMQEKLE DMQKLGYSQKDVIKMTKVSPSIYNYSIEKMQEKIE DMQKLGYSREDVIKMTKSLPTIYNLSIKNMQEKIE DIQKLGYSREDVIKMTKSLPTIYSYSIENMQQKIK FYDSINLHFLAVEDSKNLMQSTELSYARYMFYKQKGIAIDEQNYRRLFINQKQFENLYGFTKARILEDYLEGYPYTFMGDEAQTKENENTILPIQAAKSLINQDGVLEEIRTVDKIENKEMSTDEIIKEGELADD SEQ ID NO: 6 pTERF2 (MBE6149963.1) MNLEELFIKLGYTKEEYEIIVNYYTFKNSKIETISNKILENY FFLSIGYSKEDVIKMTKSSPQIYGYKLENIKQKIE DIMSLGYSKEDVIKMTKSFPSIYGLSIENMKQRIE DMILLGYSKEDVIKITKSFPVIYGLKIENMKQRIE DMILLGYSKEDVIKITKSSPVIYSLSIENMKQKIE DIILLGYSKEDVIKITKSLSSIYSYSIEYIKQKME DIMSLGYSKEDVIKITKFSPTIYGLSIENMKQKIE DIMLLGYSREEVIKMTKSSPVIYGLSIENMKQKIE DVILLGYSREDVIKMTKVLPTIYTFSIENMKQKVD FYDSIDMHELAIVDPKQLMQSVDLSYARYMFYXIDIEININDYRKLFINNTSFEKKYKITKKELLEKYNKHMENKNDRTI SEQ ID NO: 7 Ad MOON1: MA TPLKKEIANGRELKQTKMKMA TPLKKEIEDGLELKQTKKRMA TPLKKEIANGLELRQTKMKMA TPLKKEIEDRLELRQTKKRMA TPLRKEISEGLKLRQTIKRMA TPLQKEIRGGVQLRQTKKRMA TPLKKEIAGGTRLRQTKKRLA TPLRKEIADGIQLKETKKKMA TPLREEIAKGIQLTKVHKKLK TPIRKEIENGFTLRSTKKKLP SEQ ID NO: 8 Mc MOON1: MPTPKGSPRTPPTGGTLKENIWALHRHSMPLPEEEEGLASPADVKVPFSSLDINVDPKEGSVDVQPGNEMKSEVAQKPKRRSKRVSFGPVLSPEQFDKDLPPATPVRRGATPKRHSTGGGKFVNTSSSSAKKRRSVAVLPCADPIVEEEFDTLKSTTPQTPELSQPNHEFQDNEERPETSSISRTFDDQDLRESIHEINTPTGSRRSSLRSARRSPSCPEHQMNIKDVLKDIRPHSLEFCQPNQEFQEIEQLPRNSELGELNESTLEVSTPSGNRRTSLRLAGTSPGFPVDHTVPECQSWLVSSDDIEEEEDETILDSDDYTSEEDENDGGEWCSDQNKDLANSEVQDVKNRMV TPLRKEIARGFQLRKTKKKMN TPLKKEIVNGIQLRQTKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIELRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRETKKKMN TPLRKEIADGIQLRRTKRKMK TPLRREIVNGIKLGQTRKKMN TPLRMEIAGEIQLRKTKKKIN TPLRKQIADGIQLRQTKKKLP SEQ ID NO: 9 A0A7G6T5J7 (cluster_2) MVERFGRDGFPTSLDEHHALAPADLPQISASAALEAYWKVPAPPAEQPATLNETVGARDRAGAKRRRPATEHKEDDRAAQRWRIGPQPAPAREESHSGTASSSKLKSRRRREFPDNLRKLATEAKGTRFQLSGEECQRVADHGGLLALKAFVDNAEALAKLEFGGHN FSKDDILKILSHKGAAQAVQALLKNAELAKLDFSGQK LSKDDILKFLGNDGGAAQAVQALLKNAELKKLDFGGHN FSKDDILKILSHKGAAQAVQALLKNAELKKLDFGGHN FSKDDILKILGNNGAAQAVQALLKNVEPLTKLEFDGQK FSKDDILKFLGNDGGAAQAVQALLKNAEALAKLDFSGQK FSKDDILKILGNNGAAQAVQALLKNVEPLTKLEFDGQK FSKDDIVQILRNYGAAQAVQALLKNVEPLTKLEFDGQK FSKDDIVKILGKSGAARAVQAMLQNADELKNMPKPQVL AAASNQRGAAAAIRKLNK SEQ ID NO: 10 A0A662FLR2 (cluster_4) MTSLEKIISKQEIEELKAKPLGGQKIEYLEQNDELINFFLLND FNKYHIQQIIHGKNWKEKLEWARKNFKTTVLEPMG FTGYHYSQIVVNAGWLEKLEWARENFEDLLKPMG FNGSHYSQIVRNADWEDKLDWARKNFKTTVLEPMG FKGYHYSQIVVNAGWLEKLEWARENFEDLLKPMG FNGYHYSQIVLGKGWEEKLEWVCNEYKDSLMIMG FSPSQTAKIIKGKEWEQKINWIKNNYWPKDEEEKPKYTPAQLTKIITKKDWQQKLKNQN SEQ ID NO: 11 EKD25150.1 (cluster_4) MVNQFSREYFWKNIIKNSLNEKRDDLSDKIKEIVWDKAKDIFVVFDKIYNQKLDITKKVFWEKYWVYALYLLDILNGLQQDPKMDGKNKKELTRMSKILQSGYDKLLEDNMTKTSVDDIQRLKGIWFLEEDIIRLLDRNDGVKTINLLKENYKKAYIILGDVKYIHEILNQFNGYEKLKTLISKWLNIQSIVP FNGHHLLQIVRNTGREEKLEYFEDKDKVEKLIKIW FSAYNLSSIVRNIGREEKLKYLEDEGNIEKFKEIW LNVSNLTQIIRNIDRKEKLNYFEDKEEMENLKNLW FEIWDLSTIVTNGKREEKLNYFKDKDKIENLKKIW FNGSHLAEIMSWWGREEKLEYFEDKDKIENLKKIG FDGYNLLQIIRNAERKEKLNYFQDKEKMKNLEKLW FKMPELSQIARNAERKEKLDYFEDKKNIKNIEKLKKFW FESLYLAQIVRNTGRKEKLKYLGSKKRIEKLEIIR FKPWHLAEIVRNSGRKEKLRYFENEHKMKALEEMW FRPWHLAEIVSWWGREEKLKYFENQNKVENLKKIW FNGSHLSQIVVWWGRKEKLEYFEDKEKVGKLMKMW FNGYNLAQIVAWWGRKEKLEYTLNHGNEILKKISH DYYAKICQKHDRKAWFQGLSKEGPITRPKFVKKIVEKKSSWEEISMENELI SEQ ID NO: 12 WP_274699397.1 (cluster_5) MTIATPRRVFGRDTDHQSLVNRGTEEHHISKMPGEELSSKPLNLTLEQQQKLKDWGIDDVSNLVNGARSGSKNLKAILGITPEQKEKLNNLG ITSDKLVQILEHHGGSKTLQVLLNITPEQEKTLNDLG VIGDKFIFIMRCDGSSRNIEALLALTQQEKETNLNNLG ITANKLIRVIGHHGGSKALQTVLELTPEQKGKLNNLD ITVDKLVRLLGHNGGSNTLKAVLDLSLKEEKKLNNLG ITVDDLVRIAGHAGGSGSLQAVLELTSEQREELNLG ITADDLVRIVGHTGGSKSLKSLLDLLPEQKEKLINLG ITAKNLVRVVGHGGGSSLQKVLDLTPEEQKTLNDLG VVSSKLITILNWDGGSKNIQALLTLTPEQREELTNLG LTTDDLVRIVGHAGGSKTLQAVLALTPEQKGTLNDLE ISRDRLVRLVGNSGGSKSLKTLQDLLPEQREKLINLG ITADNLVRVVGRGGSSNTLQIVLALTLEQVKKLNDFE ITKGLLVRVVRTTSGARNLRALLEITPEQKEELTKLE ITSDDLVSVVEHDGGAKNLKDILGQLPNLDKRYAHEAVRVLNQNDAHASFSDQYQAIIKQDRLQDGGDAQRASVIIARSDVHQTPPISQVRSVGQRGSVIVARDSVQQEARHLQSVIESPRVPVIVNNMTRLDMQSEPSTTNMEMKQEPVHENDDEVYFESRQPPKRPCVDNGAPILRKSTNPSQSAVTLTEDSIQSLRVTQYGAELKKAIKFLPNARKISEIKSAIAGQCKDFLKKSMEGQHAAWINSWADTAVPVDDGPLRGQSVFAKRDIKKFEVLGAYSGILHEDEGSVTSVMRKEGSAPVLTYLWNTQSKKRNIDASQYANSLATINTAHLPNNQPQEIFAKQNNLDCVRFGPNYVFYVALRNINKGEELLVDYGPDYDPFYIKKESMDDSIN SEQ ID NO: 13 CAH8242182.1 (cluster_5) MQVPFLLKKNTHKDVEGERAFPSFPVQKVSHSPCRDLPAFLRPPSTKLADQDRNTEREGIAHPSTWIQSYLAGMTTVTPRRVDQQSLVNIGIEERHISKIRGDNVSSETLKTLENLTLEQQQQLKNLGIVGDTLSNLVSGAGSGSKNLKAVLGITPEKMEKLSTFR ITPDKLVKVLEYYGGSKTLQELLNLTPEQEKTLNDLG VLGDKFVSIMKCDGASKNIQAVLALSPKQEEKLNDLR ITADELVRIAGLSGGSNSLQAVLALSPEQKEKLHDLG IAVDDLVRIAGRSGGSKNLQSLLDLLPEQKEKLIDLG ITAKNLVGIVGQSGGSKTLQVLLALTPEEGTLKNLR ITADDLVRVVGHGGGSKNLQFLLKLRSEQKEELDNLG ITEKNLVGIVGHSGGSKSLQSLLDLLPEQKEKLINLG ITGKNLVGVVGHNGGSKSLKSLLDLLPEQREKLINLG LTADNLVRIVRRGGHSSTLPIVLALTPEQEKKLNDFE ITKDLLVNVVRTTNRAKNLRTLLDITPEQKEELDKLEITSGELVSVVEHGADKNLKALLGHLPNLDKRYAHQAVRLLNQNDAHASFSNQYQAMIKQDRLQDGGDGQRASVIRWAVPQTPPILQVRTVDQRGSVIVARDSVIVKNLARLHVQSEPSTHMEVKQEPSCESHDEVCIDSHQPPKRPFIDNGAPILRKSTNPSQSAVALTEDSINSWSVIQYGAELKKALNSLSNAGKRSETKSAIAVQCKDFLKKSMEGQHAAWINSWADTDVPVDDGPLRGQSVFAKRDIKKFEVLGAYSGILHEDERSVTSVMRKEGSAPVLTYLWNTQSKKRSIDASQYANSLAAINTAHLPNSKPQEIFAKHNNLDCVRFGPNYVFYVALRNINKGEELLVDYGPDYDPFQIKKESMDDSIN SEQ ID NO: 14 WP_277432871.1 (cluster_5) MTTVTPRRVDQQSLVNIGIEERHISKIRGDNVSSETLKTLENLTLEQQQQLKNLGIVGDTLSNLVSGAGSGSKNLKAVLGITPEKMEKLSTFR ITPDKLVKVLEYYGGSKTLQELLNLTPEQEKTLNDLG VLGDKFVSIMNCEGASKNMQALLTLTQEQKETLHDLE ITAGKLIRVVGHTGGAKTLQALLDITSEQKVKLNNLD ITGDRLVRLLGHKGGFNTLQAVLALSPKQEEKLNDLR ITADELVRIAGLSGGSNSLQAVLALSPEQKEKLHDLG IAVDDLVRIAGRSGGSKNLQSLLDLLPEQKEKLIDLG ITAKNLVGIVGQSGGSKTLQVLLALTPEEGTLKNLR ITADDLVRVVGHGGGSKNLQFLLKLRSEQKEELDNLG ITEKNLVGIVGHSGGSKSLQSLLDLLPEQKEKLINLG ITGKNLVGVVGHNGGSKSLKSLLDLLPEQREKLINLG LTADNLVRIVRRGGHSSTLPIVLALTPEQEKKLNDFE ITKDLLVNVVRTTNRAKNLRTLLDITPEQKEELDKLE ITSGELVSVVEHGGDKNLKALLGHLPNLDKRYAHQAVRLLNQNDAHASFSNQYQAMIKQDRLQDGGDGQRASVIRWAVPQTPPILQVRTVDQRGSVIVARDSVIVKNLARLHVQSEPSTTHMEVKQEPSCESHDEVCIDSHQPPKRPFIDNGAPILRKSTNPSQSAVALTEDSINSWSVIQYGAELKKALNSLSNAGKRSETKSAIAVQCKDFLKKSMEGQHAAWINSWADTDVPVDDGPLRGQSVFAKRDIKKFEVLGAYSGILHEDERSVTSVMRKEGSAPVLTYLWNTQSKKRSIDASQYANSLAAINTAHLPNSKPQEIFAKHNNLDCVRFGPNYVFYVALRNINKSEELLVDYGPDYEPFQIKKESMDDSIN SEQ ID NO: 15 K1X4V8 (cluster_8) MVNQFSREYFWKNIIKNSLNEKRDDLSDKIKEIVWDKAKDIFVVFDKIYNQKLDITKKVFWEKYWVYALYLLDILNGLQQDPKMDGKNKKELTRMSKILQSGYDKLLEDNMTKTSVDDIQRLKGIWFLEEDIIRLLDRNDGVKTINLLKENYKKAYIILGDVKYIHEILNQFNGYEKLKTLISKWLNIQSIVPFNGHHLLQIVRNTGRE EKLEYFEDKDKVEKLIKIWFSAYNLSSIVRNIGRE EKLKYLEDEGNIEKFKEIWLNVSNLTQIIRNIDRK EKLNYFEDKEEMENLKNLWFEIWDLSTIVTNGKRE EKLNYFKDKDKIENLKKIWFNGSHLAEIMSWWGRE EKLEYFEDKDKIENLKKIGFDGYNLLQIIRNAERK EKLNYFQDKEKMKNLEKLWFKMPELSQIARNAERK EKLDYFEDKKNIKNIEKLKKFWFESLYLAQIVRNTGRK EKLKYLGSKKRIEKLEEIRFKPWHLAEIVRNSGRK EKLRYFENEHKMKALEEMWFRPWHLAEIVSWWGRE EKLKYFENQNKVENLKKIWFNGSHLSQIVVWWGRK EKLEYFEDKEKVGKLMKMWFNGYNLAQIVAWWGRK EKLEYTLNHGNEILKKISHDYYAKICQKHDRKAWFQGLSKEGPITRPKFVKKIVEKKSSWEEISMENELI SEQ ID NO: 16 A0A835Z2I7 (cluster_10) MPALTDECVAKRKADDAMWEKVADEMIIKEMSDIFKRTPAHRDPDRKRRCRRIEVSDAIIAKQMQCVKGKPEHVTVYIQQPMSGTPLIVQGLYPVANDSAETISATERDMRKTLDRNHVTAVSWNDFCVLSSTTTNKIIDQDVVATWATDPEIVADYYRRLAIQLDAVDADTAPCVFIAGNTCQAAHETAIELGLVKRITELSPLGVTVCEIDSKCFVALESRPHPSWHLMKANAPFARAIFLETMEMLNGMVRCCATGDISSDTMHQSIVTALAIDPEELQRRAEGRSFLTQLLYGNPSGRFPTKHVHLRNVKAHLPEVQAFLLKWQSRGMKQLWAILLKGGDLYLDLPSHDQV LDTWYKRLDDSFSAFICGSVASRLLDDAFMAR LETWYERLGGNFQTFICNSVASRLLDDAFMAR LETWYERLGDKFQTFMCNSVASRLLDDAFMAP LETWYERLGANFQAFICGSVASRLLDDAFMAR LDTWYERLGDKFQTFICGSVASRLLDDAFMAR LETWYERLGGKFQTFMCNGVASRLLDDAFMAR LETWYERLGAKFQTFICGSVASRLLDDAFMAR LETWYKRLGDKFQTFICGSVASRLLDDAFMAR LETWYERLGCKFQTFVCNGVASRLLDDAFMAR LETWYERLGKDDFVTFMSGSTAKAIEDDAVNQ RILEWHELLGEYLCTFMCNGVASRLTDPRFLAVAARWIDRLGREHFCKIFGRNSFVVRVVEQPAFEAKVLGHFIRLSSNAKALKSFLKKHEGRKLDSI SEQ ID NO: 17KAG5183916.1(cluster_10) MNGFKRPGLSVVDGGELMSASVLKKQRMCVRGDLDDVSLHLPAPVEGAPLIILGLYPGPRAATEVSRTEKEMKKLLDTGTDLAWVDLCVLSANRVNRIIDQSVTETWVDDEELIRDYFSRFVSQLEVSVDGVPCVYIAGRTCQMAFEVMINLGLLSRLAQLSSLGVYLCETGGRRFMALEGRPHPSWHLVRGGEKAARDLFVETVAMLNALSRCSRGGDVCSGSMTRHLVAAMQIDTEELLRRQEGRVFMTRLLYSNDSGRFIAEHAHLRNVKAYLPEVQEVLLKWIK RSLKTLMAILLSGAFYLNLVAFDPVLEAWHER LGEKFVTFICGGVAARLGDPAFDTALEAWHER LGEKFVTFMCNGVAARLGDFAFEAALETWQER LGAKFVTFICGGVAARLGDPAFDADLEAWLER LSAKFVTFICGGVAARLGDPTFDARLEAWHGR LGAKCVTFICDSVAARLGDPTFDAALEAWHGR LGAKFVTFICDSVVARLGEPLFDTALEVWHER LGAKFVTFFCGGVAARLGDPTFDEALEAWHER LGARLVTLMCNSVAARLGDPTFDAALEAWHER LGTRLVTFICGGIAARLGDPTFDAALEAWHDR LGAKFSTFVYGGVAARLGDPAFDTALEAWHER LGSKFITFLCDGVAARLGIPAFDAALEAWQER LGEKFATFVCDSIAARLGDPAFDAALDVWRHL LGDYFVTFAGNNSVASRLTDVTFQAVAQRWFPALGKRNFARIFALSGFATRICDTKFDRRINALLHTLVDRDLLYTHLYKYRGKKMDAL SEQ ID NO: 18 A0A1G0X562 (cluster_12) MPKTKITTVSHGYDLDLMSSLPNGDPNQAKQGKIYLSGNGVYVVRDVAGIVHRGQLEFAINLEQLEQKINEPAFKAVILEKTSRAVGYTISNECFNVELNALAKAGFNNLDIDKLIFRRSSRGTVQTVLNSYNILLEKPYN LDRQQILRIASHDGGSKNIAAVQKFLPCLMNFG FNADQVIKIVGHDGGSNNIDVVQQFFPELKAFG FSADQVVKIAGHSGGSNNIAVMLAVFPRLRDFG FKADDAVRIACRTGGSHNLKAVHKNYERLRARG YDNKKIISIAASNCGTETINTIMSTDEVEESDFLYFVTTVSTPVASQNLSSASNTNINYSNRFMTARKKTSDDNTDEVEEDQHRDKRRSNGR SEQ ID NO: 19 MCH8959980.1 (cluster_15) MDTLFELSLLPADMSPWMALMLTLFVKGALVLAATGLMAYVLRHSAAAVRYLVWCAGLLSLLALPVLSVVLPQWQVGILPQTTAFVPEAPAISAPPEATPAPIPAPAPPPALPAPTPAPAPPAVDEAPYVPVAPAPVPAPAPPAEMPALEAAPPRAFAGLDFHWTTWMFLVWFAGMMVVLIRLAIAHAGVHLLVSRATMVHDDDWHLMVEDIAKRLGIGRLVRLRRSAWTAVPLSVGVWRPTIVLPEKAETWDATRRRTVLIHELAHVKRRDCLTQLLTQITCALHWFNPLVWVAARQLYIERGRACDDVVLVAGMRASTYAETLLETARSLHSAEWSTVAALAMARRSQLEGRLLAILDPTLRHRGLNRAGSILAIVLVTSIVLPLAVLHPAQAQQAEPDSVAVSEQNPSDTLLVLFRKPIVVSGDIDPDIDLDIDLRLDIDLDIDPDIDIDPNFSISADINIDPNLNLNLNVNLNPDFDFDLEFDYDPDLTGAV SDTITVEQLIKLRRYGIDREFIQGVKALG FTDLTFNELVMLGKYGADPEYIQEMQEAG FASLSAREYASMSKYGVDPEFVEAIGEAG LTDLSVEDLISLSKYGADEDLIAAMNRLG YTGLSVDDLVSMSKYGVDEDLIESLNQAG YTGLSVDDLVSASKYGVDDDLIGSLNQAG YTGLSMDDLISASKYVDDDLIGSLNQAG YTGLSMDDLISASKYGVDDDLIASLSQHG YANLTMDDLVSASKYGVDEDLIASLSQHG YTGLSMDDLVSMSKYGVDEDMIGAMAELG YTDLSVDDLVSMSKYGVDEDLVESLGRNG YTGLSVEDLVSMSKYGVDEDLIASLARYG YTDLSVDDLVSMSKYGVDEDLIESLRRHS YANVPVDDLVSMSKYGLDGDFIEEMKGVG LDNLTLDQLIQLSRYGVDADYVKEIREAG IKDLTVDKLIEMRRHGVDGDFIRSMRDNNR SEQ ID NO: 20 MGYP003304872820 (cluster_16) MNGLLEKLKKYNISQDRYHEQSQKLIELGFTQEVAEKLIVKRSSEKSVNTLIEYFEVASKF LSHQQMASLVKHNGGGNNLLSTITHKDTLRQRG LTDQDIVKMAASNGGSKNIESVMAHYDALLKMG FADQDIVKMAANIGGSKNIESVMAHYDALQKMG FADQDIIKMAARHGGTQTIEYIIENQSSIAESG LSCSDIATKCNANSGHARLKAFLSRKVSMDA SEQ ID NO: 21 V4B0R4 (cluster_18) MVTYLNINMKLILMIAITLVNFQLMDAGKGTNGDSLMKRDSVTSPETQPTILERCLNTTECLDNKQINKELKENCEDCKEGYCRLVLTSEFNCTSNSCKNAGTCIDDGETQVCQCPPGFLGKI CEYGKFGGRAVNDDDAQRLERRGGPHGSHDGHTGN HGHDSFGRRAVNGDDVQRLERRGGPHGSHDGHTGY YGYDSFGRRAVNIDDVQRLERRGGPHGSHDGHTGN YGYDSFGRRAVNIDDAQRLERRGGPHGSHDGHTGN HGYDSFGRRAVNIDDAQRLERRGGPHGSHTGHTGN YGYDRFGGRAVNDDDAQRLERRGGPHGSHTGHTGN YGYDRFGGRAVNDDDAQRLERRGGPHGSHDGHTGN HGYDSFGRRAVNVDDAQRLERRGGPHGSHDGHTGN HGYDSFGRRAANGDNAQRLESRGGPKVGNGIDGQG FGCRSGGSLRFGGNSTDTPRDEDDIHFYPNMETDQEAILYCESICESQPLCLSYYLYRGMYDNDDNIRYCFTFSELSQDLPLGNGTAYAIGDKTMQCIMTLEDQVYQNTTEKLK SEQ ID NO: 22 A0A3S0VJV9 (cluster_19) MTNYKMINLSDAEIDAYCRKFNLTRTRFLAEKGNLEQKKNSKGNSAYSEEQINRLIFRKSSKNTIATLLDLHDDLIKNE LTPQQIYQLAAHDGGSKNLKTFIEQSQALQVNNQNWASLE LNVGSVLRLLAHPGGSRNLKAYSEQIQALQAQNLSWASLN LKVEDVLRFLVHGGNANKLKAYIKHTQTLQAQKKSWASLD LQVEEVLRILANDGGSQALDTLLRSFNRLHHLGFTTRQLV TLAANKWGAQALAAVLKHTPQLLTQAYSLDFICTLAARTGGAKKIQQPKHNEPLMCRHDQVITAGSTQLTLTTTSTTGLDALLCDIDEDELSEYLHSFVTSEDNHDEDALLFGDKTFEEHEPFFAIEPDTINQITYDTDEKHEALLFGDNNVEYQKFFSDMELEAAASASTHSASPLFFSHKRKHPRESNAQEEERKKNKRDEVLEDPLNSVSLTFKTC SEQ ID NO: 23 STAR_NF-κB YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAANNGGKQALQALLDKSPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQHALQALLDKGPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAANGGGAQALQALLDRGPALRQAG FSNDNLVKVAAHDGGAHALQALLDKGPALRQAG FSNDNLVKVAAHDGGQQALQTLLDKGPALRQAG FSNDNLVRIGGHDGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 24 STAR_SMAD4 YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAANNGGKQALQALLDKSPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAAHDGSQHALQALLDKGPALRQAG FGNDNLVKVAAHDGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAANNGGAQALQALLDRGPALRQAG FSNDNLVKVAANNGGAHALQALLDKGPALRQAG FSNDNLVKVAAHDGGQQALQTLLDKGPALRQAG FSNDNLVRIGGNNGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 25 PqSTAR-based TALE-like peptide YPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAXXGGKQALQALLDKSPALRQAG FGNDNLVKVAAXXGGAQALQALLDKGPALRQAG FGNDNLVKVAAXXGSQHALQALLDKGPALRQAG FGNDNLVKVAAXXGGAQALQALLDKGPALRQAG FGNDNLVKVAAXXGSQQALQALLDKGPALRQAG FGNDNLVKVAAXXGGAQALQALLDRGPALRQAG FSNDNLVKVAAXXGGAHALQALLDKGPALRQAG FSNDNLVKVAAXXGGQQALQTLLDKGPALRQAG FSNDNLVRIGGXXGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 26 based Asp TALE-like peptide of STAR1 MDIRSLLNPLPSPPGPGERAPGKRASDATPRALPSSLPDFGLPQGKRRKTTVGSSPGGRPRQDLSTLSAFFQRARVSEDAHPASATVEQSGPLGATNW ILSGQETNRIKXXGGAKALETLSEKAEALHRAG FSKQEAVAIASXXGGSQALNTVLATHATLTAAG FTHQQIVAIASXXGGSQALNTVLATHAALTAAG FTHQQIVAIASXXGGSQALDKVLATHAPLTAAG FTHRQIVGIASXXGGSQALDTVLVRYAPLRDAG FKHEQIVGIASXXGGSQALDKVLATHAQLTAVG FKHEQIVAIASXXGGSQALDKVLVKYAPLTAAG FTHQQIVAIASXXGGSQALDTVLATHAQLTTAG FAVEDVSAIAAXXGGAPALQAVVDHLELLMTRH SKEDIVKAGAKQRGAAAHVKQMANACRIKQESAAQSPRPMPTVLVERPIDQARTAFIPELQHCDLTGGTPIWSLDEASRVVLRHPMDPIEGNNDLFPLRDLTRPLDRVYERYADKNGKCHPNVKLTNIDLASGYKKYFNELCRDSRVGLSPSETANVRGRLLTNARTEFERLIREEAAPERPCKVRQLDHGGLLEHERMLAGQYGLFLAPAHSPQDQCTLRNGRILGFYMGMFAANEQQINAIEAQHPDYESYAMDAMRPGGKLTVYSALGCANDLAFANTALCADTPEPAYDRERLNAEFIPFEVKLTDRHGKPARETVVAMVALDNAIGKEIRVDYGDAFLRQFTTPRDRARSEEDAVVVKMEVDD SEQ ID NO: 27 PqSTAR4 WP_178089118.1 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAANHGGQQALQALLDKGPALRNAG FSNDNLVKVAANGGGAQALQALLDKGPALRQAG FSNDNLVKVAANIGGAQALQALLDKGPALRQAG FSPDNLIKVAAYVGGAQALQALLDKSPALRQAG FGPDNLVKVAANNGGQQALQALLDKGPALRQAG FGPDNLVKVAAHDGGAQALQALLDKGPALRQAG FGPDNLVKVAANGGGAQALQALLDKGPTLRQAG FSPDNLVKVAAHDGGAQALQALLDKGPALRQAG FSNDNLVKVAANIGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGGQQALQALLDKGPALRNAG FSNDNLVKVAAHDGGAQALQALLDKGPALRNAG FSNDNLVKVAANIGGAQALQALLDKGPALRQAG FSNDNLVRIGGNGGAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR SEQ ID NO: 28 TALE-like polypeptide based on PqSTAR4 MNTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GSREQVIKIAAXXGGQQALQALLDKGPALRNAG FSNDNLVKVAAXXGGAQALQALLDKGPALRQAG FSNDNLVKVAAXXGGAQALQALLDKGPALRQAG FSPDNLIKVAAXXGGAQALQALLDKSPALRQAG FGPDNLVKVAAXXGGQQALQALLDKGPALRQAG FGPDNLVKVAAXXGGAQALQALLDKGPALRQAG FGPDNLVKVAAXXGGAQALQALLDKGPTLRQAG FSPDNLVKVAAXXGGAQALQALLDKGPALRQAG FSNDNLVKVAAXXGGAQALQALLDKGPALRQAG FGNDNLVKVAAXXGGQQALQALLDKGPALRNAG FSNDNLVKVAAXXGGAQALQALLDKGPALRNAG FSNDNLVKVAAXXGGAQALQALLDKGPALRQAG FSNDNLVRIGGXXGAKKTLDTLLQVYPQLTQGG VSQDGILTLATKHRGASGALQSKLSELTAAGR SEQ ID NO: 29 Artificial STAR1 MYPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAAHDGGKQALQALLDKSPALRQAG FGNDNLVKVAAHDGGAQALQALLDKGPALRQAG FGNDNLVKVAAHDGSQHALQALLDKGPALRQAG FGNDNLVKVAAHDGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAAHDGGAQALQALLDRGPALRQAG FSNDNLVKVAANNGGAHALQALLDKGPALRQAG FSNDNLVKVAANNGGQQALQTLLDKGPALRQAG FSNDNLVRIGGNGGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 30 Artificial STAR2 MYPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL [[ID=S4]] GGREQVIKIAAHDGGKQALQALLDKSPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAAHDGSQHALQALLDKGPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAANGGGAQALQALLDRGPALRQAG FSNDNLVKVAANNGGAHALQALLDKGPALRQAG FSNDNLVKVAANNGGQQALQTLLDKGPALRQAG FSNDNLVRIGGNNGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR SEQ ID NO: 31 Artificial STAR3 MYPSSSVRSALFAQSANTSSQALSAADRNKIQKAAGNATLNYVLQHLDRLQNAL GGREQVIKIAANNGGKQALQALLDKSPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAANGGSQHALQALLDKGPALRQAG FGNDNLVKVAANNGGAQALQALLDKGPALRQAG FGNDNLVKVAANNGSQQALQALLDKGPALRQAG FGNDNLVKVAANNGGAQALQALLDRGPALRQAG FSNDNLVKVAANNGGAHALQALLDKGPALRQAG FSNDNLVKVAAHDGGQQALQTLLDKGPALRQAG FSNDNLVRIGGHDGAKKTLDTLLQVYPKLTQGG VSQAEILTLATKYRGASVALQSKLKGLTAAGR

Claims

1. A programmed / programmable TALE-like polypeptide comprising an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each of the tandem repeats comprises an amino acid sequence selected from Formulas I to VII: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in X1 is either G or S. X2X3 is a repeating variable double residue (RVD). X4 is either G or S. X5 is either A or Q. X6 is either H or Q. X7 is either A or T, and X8 is either K or R; FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in X1 is either T or K. X2 is Q, E, or R. X3 is either A or G. X4 and X5 are RVD. X6 is either N or D. X7 is either T or K. X8 is either A or V. X9 can be T, R, or K. X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 It is A, D or T, and X 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; and FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD.

2. The programmed / programmable TALE-like polypeptide of claim 1, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas III to XXI: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) Where X1X2 is RVD.

3. The programmed / programmable TALE-like polypeptide of claim 1, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas I, III, and IV.

4. The programmed / programmable TALE-like polypeptide of claim 1, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas II and V through VII.

5. The programmed / programmable TALE-like polypeptide of claim 2, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas III, IV, and VIII through XIV.

6. The programmed / programmable TALE-like polypeptide of claim 2, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas V to VII and XV to XX.

7. A fusion polypeptide comprising a programmed / programmable TALE-like polypeptide fused with a fusion partner as described in any one of claims 1-6.

8. The fusion polypeptide of claim 7, wherein the fusion chaperone is selected from: i) Peptides that provide nuclease activity, ii) Provide peptides that indirectly increase transcriptional activity by acting directly on target DNA or on peptides associated with target DNA (e.g., histones or other DNA-binding proteins). iii) A polypeptide that provides methyltransferase activity, demethylase activity, acetyltransferase activity, deacetylase activity, kinase activity, phosphatase activity, ubiquitin ligase activity, deubiquitination activity, adenylation activity, deadenylation activity, SUMOylation activity, deSUMOylation activity, ribosylation activity, deribosylation activity, myristylation activity, or demyristylation activity.

9. A fusion polypeptide comprising a DNA-binding polypeptide and a fusion chaperone, wherein the DNA-binding polypeptide comprises an amino acid sequence selected from SEQ ID NO:1-6 and 9-28.

10. The fusion polypeptide of claim 9, wherein the fusion chaperone is selected from: i) Peptides that provide nuclease activity, ii) Provide peptides that indirectly increase transcriptional activity by acting directly on target DNA or on peptides associated with target DNA (e.g., histones or other DNA-binding proteins). iii) A polypeptide that provides methyltransferase activity, demethylase activity, acetyltransferase activity, deacetylase activity, kinase activity, phosphatase activity, ubiquitin ligase activity, deubiquitination activity, adenylation activity, deadenylation activity, SUMOylation activity, deSUMOylation activity, ribosylation activity, deribosylation activity, myristylation activity, or demyristylation activity.

11. The fusion polypeptide of any one of claims 7-10, wherein the fusion chaperone comprises a nuclear localization sequence (NLS).

12. A recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding said fusion polypeptide, said fusion polypeptide comprising a programmed / programmable TALE-like polypeptide of any one of claims 1-6 fused to a fusion partner, wherein said fusion partner is a polypeptide providing nuclease activity.

13. The recombinant gene editing system of claim 12, comprising a polynucleotide containing a nucleotide sequence encoding the fusion polypeptide.

14. A composition comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a programmed / programmable TALE-like polypeptide of any one of claims 1-6 fused to a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

15. A method for introducing double-strand breaks in a target polynucleotide, comprising contacting the polynucleotide with a recombinant gene editing system, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a programmed / programmable TALE-like polypeptide of any one of claims 1-6 fused to a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

16. A method for modifying a genomic sequence in a cell, comprising the step of introducing a recombinant gene editing system into the cell, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a programmed / programmable TALE-like polypeptide of any one of claims 1-6 fused to a fusion partner, wherein the fusion partner is a polypeptide providing nuclease activity.

17. The method of claim 16, wherein the recombinant gene editing system comprises a polynucleotide containing a nucleotide sequence encoding the fusion polypeptide.

18. A recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a DNA-binding polypeptide fused to a fusion partner, wherein the DNA-binding polypeptide comprises an amino acid sequence selected from SEQ ID NO: 1-6 and 9-28, wherein the fusion partner is a polypeptide providing nuclease activity.

19. The recombinant gene editing system of claim 18, wherein the recombinant gene editing system comprises a polynucleotide encoding a nucleotide sequence of the fusion polypeptide.

20. A composition comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a DNA-binding polypeptide fused to a fusion partner, wherein the DNA-binding polypeptide comprises an amino acid sequence selected from SEQ ID NO: 1-6 and 9-28, wherein the fusion partner is a polypeptide providing nuclease activity.

21. A method for introducing a double-strand break in a target polynucleotide, comprising contacting the polynucleotide with a recombinant gene editing system, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a DNA-binding polypeptide fused to a fusion partner, wherein the DNA-binding polypeptide comprises an amino acid sequence selected from SEQ ID NO: 1-6 and 9-28, wherein the fusion partner is a polypeptide providing nuclease activity.

22. A method for modifying a genome sequence in a cell, comprising the step of introducing a recombinant gene editing system into the cell, the recombinant gene editing system comprising a fusion polypeptide or a polynucleotide comprising a nucleotide sequence encoding the fusion polypeptide, the fusion polypeptide comprising a DNA-binding polypeptide fused to a fusion partner, wherein the DNA-binding polypeptide comprises an amino acid sequence selected from SEQ ID NO:1-6 and 9-28, wherein the fusion partner is a polypeptide providing nuclease activity.

23. The method of claim 22, wherein the recombinant gene editing system comprises a polynucleotide containing a nucleotide sequence encoding the fusion polypeptide.

24. The method of claim 22 or 23, wherein the cell is a eukaryotic cell.

25. A randomized library of an artificial transcription factor (TF) comprising a plurality of cells, each cell carrying a vector containing a nucleotide sequence encoding a polypeptide, said polypeptide comprising an N-terminal region, two or more tandem repeats, and a C-terminal region, wherein each said tandem repeat contains an amino acid sequence selected from Formulas I to VII: FX1NDNLVKVAAX2X3GX4X5X6ALQX7LLDX8GPALRQAG (I) in X1 is either G or S. X2X3 is a repeating variable double residue (RVD). X4 is either G or S. X5 is either A or Q. X6 is either H or Q. X7 is either A or T, and X8 is either K or R; FX1HX2QIVX3IASX4X5GGSQALX6X7VLX8X9X 10 AX 11 LX 12 X 13 X 14 G (II) in X1 is either T or K. X2 is Q, E, or R. X3 is either A or G. X4 and X5 are RVD. X6 is either N or D. X7 is either T or K. X8 is either A or V. X9 can be T, R, or K. X 10 For H or Y, X 11 For A, P, or Q, X 12 For T or R, X 13 It is A, D or T, and X 14 It can be A or V; GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) Where X1X2 is RVD; FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) Where X1X2 is RVD; ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) Where X1X2 is RVD; FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) Where X1X2 is RVD; and FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) Where X1X2 is RVD; The tandem repeats mentioned therein are randomized among cells.

26. The randomized library of claim 25, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas III to XXI: GGREQVIKIAAX1X2GGKQALQALLDKSPALRQAG (III) FSNDNLVRIGGX1X2GAKKTLDTLLQVYPKLTQGG (IV) ILSGQETNRIKX1X2GGAKALETLSEKAEALHRAG (V) FSKQEAVAIASX1X2GGSQALNTVLATHATLTAAG (VI) FAVEDVSAIAAX1X2GGAPALQAVVDHLELLMTRH (VII) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (VIII) FGNDNLVKVAAX1X2GSQHALQALLDKGPALRQAG (IX) FGNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (X) FGNDNLVKVAAX1X2GSQQALQALLDKGPALRQAG (XI) FGNDNLVKVAAX1X2GGAQALQALLDRGPALRQAG (XII) FSNDNLVKVAAX1X2GGAHALQALLDKGPALRQAG (XIII) FSNDNLVKVAAX1X2GGQQALQTLLDKGPALRQAG (XIV) FTHQQIVAIASX1X2GGSQALNTVLATHAALTAAG (XV) FTHQQIVAIASX1X2GGSQALDKVLATHAPLTAAG (XVI) FTHRQIVGIASX1X2GGSQALDTVLVRYAPLRDAG (XVII) FKHEQIVGIASX1X2GGSQALDKVLATHAQLTAVG (XVIII) FKHEQIVAIASX1X2GGSQALDKVLVKYAPLTAAG (XIX) FTHQQIVAIASX1X2GGSQALDTVLATHAQLTTAG (XX) FSNDNLVKVAAX1X2GGAQALQALLDKGPALRQAG (XXI) Where X1X2 is RVD.

27. The randomized library of claim 25, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas I, III, and IV.

28. The randomized library of claim 25, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas II and V through VII.

29. The randomized library of claim 26, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas III, IV, and VIII through XIV.

30. The randomized library of claim 26, wherein each of the tandem repeats comprises an amino acid sequence selected from formulas V to VII and XV to XX.

31. The randomized library of any one of claims 25-30, wherein the polypeptide comprises 6-8 TRs.