Method for classifying cho cells
By measuring the methylation of CHO cell genomic DNA and establishing biomarkers using machine learning models, the problems of CHO cell population screening and stability were solved, achieving efficient and economical classification of CHO cells and improving the stability and quality of heterologous protein production.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- EVONIK OPERATIONS GMBH
- Filing Date
- 2024-11-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to efficiently and economically screen and maintain stable CHO cell populations for recombinant protein production, leading to decreased productivity and economic losses. Furthermore, existing methods cannot accurately predict long-term phenotypic adaptation of cells.
By measuring the methylation level of CHO cell genomic DNA and using machine learning models to associate DNA methylation with target phenotypes expressed by cells, biomarkers were established to classify CHO cells.
This enables accurate and convenient classification of CHO cells, ensuring the stability and high efficiency of heterologous protein production, reducing economic losses, and improving production speed and quality.
Abstract
Description
Technical Field
[0001] This invention relates to an epigenetic method for quantitatively and qualitatively classifying Chinese hamster ovary (CHO) cells for bioprocessing based on target phenotypes. Specifically, the method involves determining the methylation level of specific CpG sites, using a specific statistical algorithm to identify the target phenotype expressed by CHO cells, and classifying CHO cells based on the target phenotype expressed by the cells. Background Technology
[0002] Chinese hamster ovary (CHO) cells have been a core tool cell for the industrial production of recombinant therapeutic proteins since 1987, and are therefore widely used in biopharmaceutical manufacturing. Currently, approximately 70% of all recombinant biopharmaceutical proteins globally, as well as all monoclonal antibodies approved since 2016, are produced using CHO cells. The advantages of using CHO cells for biopharmaceutical production include: tolerance to gene manipulation, ease of scaling up production, rapid growth rate, and the ability to perform human-compatible post-translational modifications. However, CHO cell-based biopharmaceutical manufacturing systems currently face a bottleneck: protein productivity decreases with prolonged culture time.
[0003] Cell lines typically have high initial protein expression / production levels, but yields decrease during long-term culture. This leads to reduced process yields, impacts project timelines, and increases costs. Changes in the cell culture environment can alter cell behavior and protein yield in production cell lines.
[0004] Current methods for determining whether a CHO clone is suitable for target protein production are not only time-consuming but also lack sufficient specificity for reading out the interactions between the cell and its microenvironment, which is crucial for screening optimal cell clones or cells for protein production. For example, US2017 / 0081732 discloses a method for screening cells capable of long-term expression or secretion of peptides based on histone acylation assays (measured by PCR), but this method is complex and lacks accuracy. Furthermore, PCR is a non-high-throughput method and cannot achieve whole-genome CpG detection, resulting in an incomplete detection process.
[0005] Furthermore, CHO clones with the same genetic background may exhibit heterogeneous phenotypes, leading to instability and reduced efficiency in established production processes and causing economic losses in industrial-scale heterologous protein production. Methods relying solely on phenotypic analysis to compare and screen CHO clones cannot guarantee long-term stability. Genotypic comparisons of CHO clones also fail to explain how genes differentially express themselves to adapt to environmental conditions. As shown by Wippermann A et al. in Appl Microbiol Biotechnol. 2014 Jan;98(2):579-89, butyrate supplementation, which is known to increase specific yields in CHO cells, can also lead to alterations in epigenetic silencing events.
[0006] Therefore, there is a need in the field for an efficient and economical tool for the global assessment and regulation of CHO cell metabolism and protein production. Methods for screening and maintaining homogeneous CHO cell populations are also needed to improve production speed, quality, efficiency, and stability. Specifically, there is an urgent need for novel descriptive and predictive markers to identify CHO cells with positive and valuable phenotypes, enabling early detection and classification of CHO cells suitable for genetic modification and / or heterologous protein production. Summary of the Invention
[0007] This invention solves the aforementioned problems by providing an accurate, simple, and reproducible method for classifying CHO cells suitable for growth and protein production early in the bioprocessing stage. Specifically, this invention provides a method that ensures the stability and high efficiency of heterologous protein production (especially industrial-scale production) while reducing economic losses.
[0008] The inventors have identified a method for classifying CHO cells using epigenetics, in which DNA methylation of the CHO cell genome is measured and a machine learning-based model is used to correlate DNA methylation levels with the target phenotype expressed by the cells. In other words, measuring DNA methylation at specific locations (CpG sites) can accurately predict the expression of the target phenotype in any tested CHO cell, thereby enabling cell classification. Since genotypic comparisons of CHO clones cannot determine how genes differentially express themselves to adapt to environmental conditions, and phenotypic analysis alone cannot guarantee long-term stability, epigenetic methods (especially DNA methylation) provide a cutting-edge technology that can determine whether CHO test cells exhibit a target phenotype, and what kind of target phenotype, at any stage of CHO cell bioprocessing (especially early stages of bioproduct production in CHO cells), thereby determining their suitability for gene modification and other scientific research. The method described in any aspect of this invention allows DNA methylation to be used as a tool for the quantitative and qualitative classification of CHO cells based on protein production.
[0009] According to one aspect of the present invention, a method for constructing a first target phenotypic biomarker for CHO cells is provided, the method comprising the following steps: (a) Determine the methylation value of all CpG sites in the genomic DNA of a CHO cell population; the cell population represents the first target phenotype and is a portion of the training sample; (b) Identify a set of specific CpG sites from the training samples of step (a), wherein the CpG sites have consistent and reproducible methylation values; and (c) Use a machine learning-based model to process the methylation values of step (b) and the target phenotype of the training samples; This allows for the acquisition of specific CpG sites with corresponding weighting coefficients, defining biomarkers for the first target phenotype.
[0010] Specifically, the machine learning-based model is a classifier algorithm (Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Stanford, UC: Stanford University.), which may be selected from the following groups: Random Forest, Decision Tree (A. Floares and A. Birlutiu, The 2012 International Joint Conference on Neural Networks (IJCNN), Brisbane, Queensland, Australia, 2012, pp. 1-7), Support Vector Machine (SVM) (Wang, C., et al. ScientificReports, (2020) 10:5880 and Yousef, M. et al, BMC Bioinformatics 2009, 10:337), K-Nearest Neighbors (KNN) (Wang, C., 2020), Neural Network (Daoud, M. et al.). (al., 2019, 97: 204-214), multilayer perceptron (Wang, C., 2020 and Daoud, M. et al., 2019) and Gaussian mixture model (Prabakaran, I. et al. Cancer Res; 79(13), 2019: 3492-3502).
[0011] In one instance, the specific CpG site set in step (b) can ultimately be selected using machine learning techniques such as random forests (Breiman L (2001). "Random Forests". Machine Learning. 45 (1): 5--32. doi:10.1023 / A:1010933404324)). Random forest predictors are powerful machine learning algorithms capable of handling large datasets with high-dimensional features, such as DNA methylation data. They work by constructing multiple decision trees based on random subsets of data and features, and then combining their predictions to obtain more accurate and robust results. In the context of screening for specific CpG sites for CHO cell line engineering, random forest predictors can be trained on a set of labeled samples with known phenotypes (e.g., high or low yield, growth rate or viability). The input feature is the methylation level of CpG sites throughout the genome, and the output is the predicted phenotype for each sample. By analyzing the importance scores of each CpG site in the random forest model, the sites with the most informative and discriminative power for phenotypic prediction can be identified.
[0012] The biomarkers described in any aspect of this invention may also be referred to as "CHO methylation clocks," which can establish a correlation between the target phenotype expressed by CHO cells and the DNA methylation pattern. Using the methods described in any aspect of this invention, CHO cells can be classified based on their methylation patterns and the target phenotype expressed or expected to be expressed.
[0013] As used herein, the term "biomarker" refers to a natural feature capable of identifying specific pathological or physiological processes in CHO cells. Specifically, a biomarker refers to a feature, particularly DNA methylation patterns, that can be used to identify target phenotypes in CHO cells. Therefore, a biomarker for identifying target phenotypes in CHO cells refers to a specific CpG site with corresponding weighted coefficients as parameters. The biomarker for the first target phenotype refers to a first group of specific CpG sites with corresponding weighted coefficients. The biomarker for the second target phenotype refers to a second group of specific CpG sites with corresponding weighted coefficients. The biomarkers for the third, fourth, and subsequent target phenotypes refer to the specific CpG sites of the third, fourth, and subsequent groups, respectively, with corresponding weighted coefficients. Multiple biomarkers can be integrated to obtain a biomarker set to determine a series of target phenotypes in CHO test cells. Specifically, the specific CpG site set is determined from all CpG sites in CHO cells based on the methylation values measured in step (a) using any DNA methylation detection method known in the art. More specifically, the specific CpG sites are selected based on reliable methylation values (i.e., the methylation values of the specific CpG sites are consistent and reproducible using any methylation detection method known in the art). When DNA methylation is measured multiple times using the same method, the group of specific CpG sites yields the same or highly similar methylation values. Therefore, the CpG sites produce consistent and reproducible methylation value readings.
[0014] Steps (b) and (c) of any aspect of the present invention can be performed by a person skilled in the art via a computer.
[0015] As used herein, the term "CHO cell genome" refers to the genomic DNA of CHO cells, excluding viral DNA (especially CMV and SV40) used to introduce exogenous DNA into the cell. Specifically, a CHO cell genome can refer to a cell with a genome structure naturally visible in the wild. The term may also include genes added to the CHO genome through genetic modification (e.g., for purposes such as improving protein production), but may not include, and may not exclude, viral genes and promoters used to introduce genes into the CHO genome. Therefore, the term "CHO cell genome" may exclude viral genes and promoters, and / or may include endogenous or homologous genes of CHO cells and / or genetically modified endogenous or homologous genes of CHO cells, and / or intergenic genes (i.e., DNA located between CHO cell genes).
[0016] The CHO cell line refers to an immortalized Chinese hamster ovary cell line (CHO) derived from the Chinese hamster (Cricetulus griseus). Specifically, the CHO cell line can be selected from the following groups: CHO-K1 (ATCC), CHO-DG44 (Thermo Fisher Scientific), CHO-DXB11 (ATCC), ExpiCHO-S™ cells (Thermo Fisher Scientific), FreeStyle™ CHO-S™ cells (Thermo Fisher Scientific), and CHO 1-15. 500 (ATCC) and Agarabi CHO (ATCC).
[0017] As used herein, the term "CHO cell population" refers to a plurality of CHO cells used as reference CHO cells, which exhibit at least one target phenotype. Different CHO cells expressing the target phenotype are then used to determine different CpG sites that are typically methylated, thereby determining the methylation pattern or reference methylation map of the first target phenotype. In one instance, different CpG sites are collected from a plurality of CHO cells, each exhibiting the first target phenotype, to obtain a reference methylation map or biomarker of the first target phenotype. Therefore, the reference methylation map described in any aspect of the invention is not a natural methylation map derived from a single CHO cell, but rather an artificial map obtained by combining relevant CpG sites from different reference CHO cell lines (each exhibiting at least one target phenotype). The CHO cell population is considered representative of the first target phenotype, and the methylation pattern of each CHO cell constituting the CHO cell population of the first target phenotype constitutes part of a training sample used to generate a biomarker for the first target phenotype. In one instance, different CpG sites are collected from a single reference CHO cell exhibiting at least one target phenotype. In another example, different CpG sites are collected from multiple CHO cells, each exhibiting at least one target phenotype. The reference atlas can be determined from multiple training samples using multivariate statistical methods, such as principal component analysis or multidimensional scaling.
[0018] As used herein, the term "target phenotype" in relation to CHO cells refers to cells exhibiting at least one characteristic selected from the group consisting of: optimal heterologous protein production, phenotypic homogeneity, protein quality, optimal carbohydrate metabolism, optimal amino acid metabolism, optimal lipid metabolism, optimal cell viability, and combinations thereof. Specifically, a target phenotype refers to the characteristics exhibited by CHO cells according to any aspect of the invention that are conducive to cell viability, suitable for protein production, and enhance overall cellular protein production levels. All these phenotypes can be used collectively to classify CHO cells and determine their suitability for protein production. The more target phenotypes a CHO cell is predicted to express or is classified as, the more suitable it is for heterologous protein production and larger-scale protein production.
[0019] As used herein, the term "fitness" refers to a CHO cell line suitable for achieving optimal heterologous protein production. In one example, the suitability of CHO cells for optimal heterologous protein production can be determined before introducing the transgene. In this case, the CHO cells may possess at least one target phenotype or characteristic that enables them to grow well, readily take up the target transgene, and achieve optimal heterologous protein production after taking up the transgene, wherein the protein is a product of the target transgene. These characteristics or target phenotypes include at least: optimal glucose consumption, growth rate, lactate production, ammonia accumulation, etc. When CHO cells are confirmed to exhibit at least one of the above target phenotypes, it can be determined that the cells are suitable for optimal heterologous protein production upon introduction of the target transgene.
[0020] In another example, it can be determined whether CHO cells are suitable for optimal heterologous protein production after the transgene has been introduced into the cells. In this case, the CHO cells are genetically modified using methods known in the art to introduce the transgene, and the genetically modified cells are capable of optimal heterologous protein production, wherein the protein is the translation product of the transgene. The CHO cells in this example may have at least one target phenotype that enables the genetically modified cell line to have good viability and optimal target protein productivity. These target phenotypes may include: cell viability (survival rate), protein productivity (in terms of protein yield and quality), phenotypic uniformity, cell exhaustion, etc. Therefore, the method described in any aspect of the present invention can be used for genetically modified (i.e., transgenes have been introduced into the cells) CHO cells or unmodified CHO cells. In both cases, the CHO cells can be used for heterologous protein production.
[0021] As used in this article, "transgenic" refers to a gene that is taken from the genome of one organism and inserted into the genome of another organism through gene modification techniques. For example, a human gene is artificially introduced into the genome of a CHO cell to produce at least one target protein, especially a therapeutic protein.
[0022] The term "therapeutic protein" as used in this article refers to genetically engineered versions of natural human proteins. Examples of therapeutic proteins include: antibody-based drugs, anticoagulants, clotting factors, bone morphogenetic proteins, engineered protein scaffolds, enzymes, growth factors, hormones, interferons, interleukins, etc.
[0023] The term "cell viability" as used in this article refers to the ability of cells to remain viable and proliferate. Cell viability is a measure of the proportion of live cells in a cell population. Cell proliferation refers to the increase in the number of cells due to cell division. Commonly used methods for detecting cell viability include: BrdU cell proliferation assay, MTT cell proliferation assay, trypanosome cyanocyte count, and ATP cell viability assay.
[0024] The term "cell exhaustion" as used in this article refers to a state in which cells lose their ability to perform metabolic activities (including the production of heterologous proteins). Cell exhaustion can be measured using metabolite assays.
[0025] The term “phenotypic homogeneity” used in this article refers to the state in which all cells in a cell population exhibit the same phenotype under specific conditions.
[0026] The term "heterologous protein production" as used in this article refers to the production of non-cellular endogenous proteins, i.e., the expression of genes that are not naturally expressed in host CHO cells, especially transgenic proteins. Commonly used quantitative assays for heterologous protein production include enzyme-linked immunosorbent assay (ELISA), chromatography, and bioprocess analyzers.
[0027] The term "host cell" as used in this article refers to a cellular system used for the expression of heterologous proteins. For example, CHO cells are a primary host for the production of various therapeutic proteins.
[0028] As used herein, the term "optimal heterologous protein production" refers to the ability of CHO cells to produce proteins at high levels, particularly during the industrial or large-scale production of recombinant proteins, where the proteins are typically functional proteins not naturally present in wild-type CHO cells. Specifically, to achieve optimal heterologous protein production, CHO cells exhibit minimized metabolic burden and cytotoxicity. More specifically, "optimal heterologous protein production" refers to high-level protein production where CHO cells not only produce high yields of the target protein but also maintain protein production throughout the entire production period (i.e., the extended culture period), thereby ensuring consistent and stable protein quality. Specifically, for CHO cells to achieve "optimal heterologous protein production," they must exhibit at least one or more of the following target phenotypes: phenotypic homogeneity, protein productivity, and protein quality. More specifically, to achieve "optimal heterologous protein production," CHO cells may possess phenotypic homogeneity and protein productivity, or phenotypic homogeneity and protein quality, or protein productivity and protein quality, or phenotypic homogeneity, protein productivity, and protein quality.
[0029] The term "protein productivity" as used in this article refers to the amount of protein produced by each living cell at a single titer point. It is calculated by dividing the titer (mg) by the living cell density (VCD, or cells / mL), and the final measurement is expressed as the amount of protein per cell (mg / cell).
[0030] The term "protein quality" refers to post-translational modifications that determine protein efficacy and function. These modifications typically include phosphorylation, glycosylation, ubiquitination, methylation, acetylation, and protein folding. For example, protein glycosylation is a key quality attribute regulating the efficacy, stability, and half-life of therapeutic proteins. Protein quality can be determined using techniques such as immunoprecipitation, biochemical assays, and mass spectrometry (MS).
[0031] The term "carbohydrate metabolism" as used in this article refers to almost all or all of the biochemical processes within cells responsible for the generation, breakdown, and interconversion of carbohydrates. It involves multiple pathways, such as glycolysis, gluconeogenesis, glycogenolysis, and glycogen synthesis. For example, glycolysis is one of the key metabolic pathways in CHO cells. CHO cells consume glucose as their primary carbon source to generate energy through glycolysis, producing lactate as the most common metabolic byproduct. Specifically, "optimal carbohydrate metabolism" refers to the ideal or optimal carbohydrate metabolism that CHO cells can achieve.
[0032] Similarly, the term "amino acid metabolism" as used in this article refers to all biochemical processes within CHO cells responsible for the generation, breakdown, and interconversion of amino acids. Amino acids are the basic building blocks of proteins, constituting all proteinaceous substances in cells, including the cytoskeleton, the protein components of enzymes, receptors, and signaling molecules. Furthermore, amino acids are used for cell growth and maintenance. For example, glutamine breakdown is a key metabolic pathway in CHO cells. Glutamine breakdown is the primary pathway by which CHO cells absorb organic nitrogen for biomass synthesis while releasing ammonium as a major byproduct. Specifically, the term "optimal amino acid metabolism" refers to the ideal or optimal state of amino acid metabolism that CHO cells can achieve.
[0033] The term "lipid metabolism" as used in this article refers to the synthesis and degradation of lipids within CHO cells, including lipolysis for energy or storage, and the synthesis of structural and functional lipids. Lipids are major components of cell membranes, acting as second messengers in cell communication and participating in signal transduction, transport, and secretion. Lipids also serve as an important energy source through β-oxidation and the tricarboxylic acid (TCA) cycle. Lipid metabolism can significantly affect cell growth. For example, the synthesis and degradation of triglycerides in CHO cells can greatly influence overall cell metabolism and survival. Specifically, the term "optimal lipid metabolism" refers to the ideal or optimal lipid metabolism state achievable in CHO cells.
[0034] Carbohydrate, amino acid, and lipid metabolism can be determined by metabolite detection, HPLC, and bioprocess analyzers. These methods are disclosed in at least Coulet, M. et al. (Cells, 2022, 11, 1929), Fan Y et al. (Biotechnol Bioeng, 2015, 112 (3):521-535), and Ali AS et al. (Biotechnol J, 2018, 13 (10):e1700745).
[0035] Specifically, the target phenotype described in any aspect of the present invention is selected from the group consisting of: phenotypic uniformity, protein quality, optimal carbohydrate metabolism, optimal amino acid metabolism, optimal lipid metabolism, optimal heterologous protein production, optimal cell viability, and combinations thereof.
[0036] In the context of this invention, the term "methylation value" refers to the average methylation value of at least one cytosine (C) residue within a CHO cell genomic DNA sequence. Both β and M values can be used as indicators of (average) methylation levels. M values have higher statistical validity in analyses of differences in methylation levels. β values, however, are more biologically interpretable, and should be reported simultaneously when using the M value method for differential methylation analysis. Methylation values at CpG sites or DNA regions with multiple CpG sites can be determined using any method known in the art. In one example, DNA methylation values were determined using a bead-based DNA methylation array.
[0037] The term "methylation rate" refers to the number of methylated cytosines at a specific site divided by the total number of cytosines covered. The methylation rate of the CpG site in step (b) can be advantageously determined using hydrothionite sequencing.
[0038] The term "CpG site read coverage" should be understood as the number of sequencing reads aligned to known CpG sites in a reference sequence. The methylation rate and read coverage of genomic CpG sites in step (a) can be determined using hydrothionite sequencing. In one example, hydrothionite sequencing is used to determine the methylation rate and read coverage of CpG sites in step (a); and hydrothionite sequencing is used in step (b) to determine the specific CpG site set by cutoff values.
[0039] Step (b) of the first aspect of the present invention may include a hydrothionite conversion process.
[0040] Hydrothionite treatment of genomic DNA (used interchangeably with the term "hydrothionite modification") refers to the treatment of genomic DNA with a deamination agent (e.g., hydrothionite) that can be used to treat all DNA, regardless of whether it is methylated. Specifically, the term "hydrothionite" as used herein includes any suitable type of hydrothionite (e.g., sodium bisulfite), or other chemical agents capable of chemically converting cytosine (C) to uracil (U) without chemically modifying methylated cytosine, thus enabling differential modification of DNA sequences based on their methylation status, as described in US2010 / 0112595. Reagents used herein capable of "differentiating" methylated or unmethylated DNA include any reagent capable of modifying methylated and / or unmethylated DNA through processes that produce distinguishable products, thereby enabling identification of DNA methylation status. Such processes may include, but are not limited to, chemical reactions (e.g., hydrothionite-mediated C to U conversion) and enzymatic treatments (e.g., cleavage by methylation-dependent endonucleases). Therefore, enzymes that preferentially cleave or digest methylated DNA can cleave or digest DNA molecules more efficiently when DNA is methylated; while enzymes that preferentially cleave or digest unmethylated DNA exhibit higher efficiency when DNA is unmethylated.
[0041] Therefore, before performing step (a) as described in any aspect of the invention, the genomic DNA contained in or obtained from the cell is first subjected to hydrothionite treatment.
[0042] Alternative methods known in the art can be used instead of hydrothionite treatment. Those skilled in the art will know which other methods can be used. In one example, TET-assisted pyridinium borohydride sequencing (TAPS) can be used to detect 5mC and 5hmC (Yibin Liu et al., Nature Biotechnology, 37:424-429, 2019).
[0043] Whole-genome bisulfite sequencing is a method for analyzing whole-genome DNA methylation based on sodium bisulfite conversion of genomic DNA, followed by sequencing on a next-generation sequencing platform. The sequences are then re-aligned to a reference genome, and the methylation status of CpG dinucleotides is determined based on mismatches resulting from the conversion of unmethylated cytosine to uracil. Specifically, in step (a), bisulfite sequencing is used to determine the methylation rate and read coverage of CpG sites; in step (b), bisulfite sequencing is used to determine the specific CpG site set based on truncation values.
[0044] In this document, the term "test" used in conjunction with the term "cell" refers to cells tested using the methods described in any aspect of this invention, forming the basis for the analytical applications of this invention. Therefore, "test cell" refers to CHO cells or a group of CHO cells tested using the methods described in any aspect of this invention, or a pattern obtained or generated in this context. Conversely, the terms "reference" or "control" refer to objects primarily determined in advance for comparison with the test subject. Specifically, "test cell" refers to cells whose methylation status needs to be determined to determine whether they are suitable for optimal heterologous protein production; "control" or "reference" refers to cells known to exhibit optimal heterologous protein production or their methylation patterns.
[0045] The term "CpG site" or "methylation site" as used in this article refers to nucleotides in nucleic acids (DNA or RNA) that are prone to methylation, whether through natural in vivo events or artificial chemical methylation events in vitro. Some of these sites in cells may be hypermethylated, while others may be hypomethylated. In some cases, CpG sites may not be completely hypermethylated or hypomethylated, but rather exhibit a quantifiable level of methylation. Therefore, methylation can be quantified, rather than being an absolute state of hypermethylation or hypomethylation.
[0046] The term "methylated nucleic acid molecule" as used in this article refers to a nucleic acid molecule containing one or more methylated nucleotides.
[0047] The training samples associated with the target phenotype of CHO cells in step (a) can be collected at any stage of CHO cell growth. The samples are always CHO genomic DNA.
[0048] Any machine learning-based model can be used to screen for the most predictive CpG sites, differentially methylated regions (DMRs), low-methylated regions (LMRs), and CpG sites within subsets of these regions for each target phenotype. These specific CpG sites screened by any aspect of this invention can then be used to develop methods for classifying CHO cells into different categories based on target phenotype and DNA methylation characteristics.
[0049] Specifically, the specific CpG sites described in any aspect of the present invention are distributed in low methylation regions (LMR), CpG islands, variable methylation sites and / or differential methylation regions (DMR) in the CHO cell genome.
[0050] A low-methylation region (LMR) is a genomic region in which less than 60% of CpG sites are methylated. More specifically, less than 50%, 40%, 30%, 20%, or 10% of CpG sites are methylated in an LMR. LMRs in genomic DNA can be identified or detected using any method known in the art. Known methods include the use of procedures such as MethylSeekR. Specifically, an LMR in genomic DNA has at least three consecutive CpG sites, and no single nucleotide polymorphisms (SNPs) are present at any CpG site. More specifically, an LMR in genomic DNA is identified at least according to the methods disclosed in Burger L (Nucleic Acids Research, 2013, 41 (16):e155) and / or Stadler M (Nature, 2011, 480, 490-495). The average methylation level of LMR is known to be between 10% and 50%; it is a region with low CG density that does not overlap with CpG islands; it is often rich in H3K4me1, DHS, and p300 / CBP; and / or is mainly located at the distal end of the promoter in intergenic or intronic regions. Specifically, LMR: - Average methylation levels range from 10% to 50%; - This is a region with low CG density; - Rich in histone H3 (H3K4me1) with lysine 4 monomethylation, DNase I hypersensitive site (DHS), transcriptional coactivator CREB binding protein (CPB), and p300; - Primarily located at the distal end of promoters in intergenic or intronic regions; and / or - No single nucleotide polymorphisms (SNPs) were found at any CpG position.
[0051] Hypomethylated regions (LMRs) represent a key feature of the dynamic methylome. LMRs are locally downgraded regions in the DNA methylation map, representing CpG-poor distal regulatory regions that typically reflect the binding of transcription factors and other DNA-binding proteins. LMRs were first described in mice (Stadler et al., 2011, Nature, 480, 490-95). The evolutionary conservation of LMRs outside of mammals has not been investigated.
[0052] As used herein, a “CpG island” refers to a DNA sequence with a functionally or structurally deviated CpG density. For example, Yamada et al. described a set of criteria for identifying CpG islands: a length of at least 400 nucleotides, a GC content greater than 50%, and an OCF / ECF ratio greater than 0.6 (Yamada et al., 2004, Genome Research, 14, 247-266). Other researchers use a more lenient definition: a length of at least 200 nucleotides, a GC content greater than 50%, and an OCF / ECF ratio greater than 0.6 (Takai et al., 2002, Proc. Natl. Acad. Sci. USA, 99, 3740-3745). In the context of this invention, the terms “methylation map,” “methylation pattern,” and “methylation state” are used to describe the methylation status, condition, or state of a genomic sequence, referring to the methylation characteristics of a DNA fragment at a specific genomic locus. Such characteristics include, but are not limited to: the presence of methylated cytosine (C) residues in the DNA sequence, the location of methylated C residues, the proportion of methylated C residues in a specific residue fragment, and differences in allele methylation due to, for example, different allele sources.
[0053] Differentially methylated regions (DMRs) are genomic regions with different methylation states among multiple biological samples (such as tissues, cells, individuals, etc.). These are genomic regions with phenotypic differences. Statistical power may be higher when adjacent differentially methylated sites (DMPs) are combined for analysis (Gu H et al., 2010, Nat Methods, 7:133-6). The length of DMRs can range from hundreds to thousands of bases (Rakyan et al., 2011, Nat Rev Genet, 12:529-41; Bock C, 2012, Nat Rev Genet, 13:705-19).
[0054] Distributed regional modalities (DMRs) can be distributed throughout the genome, but are particularly identified near gene promoter regions, intragenetic regions, and intergenetic regulatory regions. There are two types of regions: predefined and user-defined. Regions with special biological significance (e.g., CpG islands, CpG shores, UTRs, etc.) belong to predefined regions. Many traditional statistical tests (including t-tests and Wilcoxon rank-sum tests) can be performed at the region level. For user-defined regions, they can be determined based on criteria such as fixed region length, a fixed number of significantly adjacent CpG sites, and significant and smooth evaluation effect sizes.
[0055] Partially methylated domains (PMDs) are extended regions in the genome that exhibit a reduced average level of DNA methylation. They cover gene-poor and transcriptionally inactive regions and tend to form heterochromatin.
[0056] Differentially methylated sites (DMPs) are CpG sites with different DNA methylation states among different biological samples, and are considered to be potential functional regions involved in gene transcription regulation.
[0057] Specifically, steps (a)-(c) of any aspect of the present invention are repeated for the second target phenotype and subsequent target phenotypes, so that each target phenotype corresponds to at least one biomarker, thereby generating a set of biomarkers for different target phenotypes.
[0058] Specifically, the biomarkers described in any aspect of the present invention are a set of specific CpG sites, corresponding weighting coefficients, and the intercept of a linear model equation associated with a first target phenotype. Therefore, a biomarker set refers to a variety of biomarkers, each of which is associated with at least one target phenotype exhibited by CHO cells.
[0059] The minimum coverage cutoff value determined in step (b) can be 10, 9, 8, 7, 6, 5, 4, or 3. Specifically, the minimum coverage cutoff value determined in step (b) can be 3.
[0060] According to another aspect of the present invention, a method for classifying CHO test cells is provided, the method comprising the following steps: (a) Determining the DNA methylation level at specific CpG sites in genomic DNA extracted from CHO test cells; and (b) Compare the methylation levels of these specific CpG sites from step (a) with the methylation levels of the same specific CpG sites in reference CHO cells known to exhibit at least one target phenotype; and (c) Based on this, the target phenotype expressed by CHO test cells can be deduced; Wherein, the specific CpG sites and their respective weighting coefficients are parameters of the target phenotypic biomarker, and The classification of the CHO test cells is determined by the target phenotype expressed by the CHO test cells.
[0061] Specifically, the biomarkers according to any aspect of the present invention are determined using the method of the first aspect of the present invention.
[0062] The DNA methylation level described in any aspect of this invention can be determined using any method known in the art. In one example, the methylation level can be determined using a commercial Illumina™ platform. In another example, an array of DNA methylation microbeads can be used to determine the methylation level.
[0063] Arrays enable high-throughput and robust methods for determining semi-quantitative / quantitative DNA methylation information from small amounts of extracted target DNA samples. These custom-designed arrays can utilize Illumina iScan and Infinium platform technologies or equivalents; for example, each chip can hold 100,000 different types of microbeads that covalently bind DNA methylation probes. Each probe represents a CpG methylation site at the end of its sequence. Before hybridization on the array chip, the DNA sample undergoes bisulfite conversion, amplification, fragmentation, precipitation, and resuspension. The DNA hybridizes with the microbeads at each CpG site on the chip, allowing for specific detection of methylation changes at each site via single nucleotide extension. This approach is particularly advantageous because array-based methods are simple to operate and provide accurate and reproducible results.
[0064] Furthermore, compared to traditional sequencing (which can take weeks to generate data), array technology offers a shorter turnaround time. The volume and complexity of the generated data are lower than sequencing, resulting in a less computationally intensive process. This allows for faster computation to yield interpretable results from experimental groups. Overall, microarray technology is approximately 10 times faster and 10 times cheaper than traditional sequencing, while still providing quantitative detection of methylation levels at specific CpG sites.
[0065] As used in this article, the term "array" refers to an intentionally constructed collection of probe molecules that can be prepared by synthetic or biosynthetic methods. The probe molecules in an array may be identical or different from each other. Arrays can take various forms, such as soluble molecular libraries; or libraries of compounds attached to resin beads, silica chips, or other solid supports.
[0066] Specifically, the array provides a convenient platform for the simultaneous analysis of a large number of CpG sites, such as at least 2, 3, 4, 5, 6, 7, 8, 9, 10, 50, 100, 500, 1000, 5000, 10,000, 100,000 or more sites or loci. Specifically, the array contains multiple different probe molecules that may be attached to a substrate or otherwise spatially distinguished within the array. Examples of arrays that can be used in any aspect of the invention include slide arrays, silicon wafer arrays, liquid phase arrays, bead-based arrays, etc. In one example, the array technology used according to any aspect of the invention combines a miniaturized array platform, high-level multiplexing, and scalable automation for sample handling and data processing.
[0067] In particular, the array according to any aspect of the invention can be an array of arrays (also referred to as a composite array) having multiple individual arrays configured to allow simultaneous processing of multiple samples. Examples of composite arrays and the underlying technology are disclosed in at least US 6,429,027 and US 2002 / 0102578. The substrate of the composite array may include multiple individual array sites, each array site having multiple probes, and physically separated from other measurement sites on the same substrate, such that fluid contacting one array site is prevented from contacting another array site. Each array site may have multiple different probe molecules that are directly attached to the substrate or attached to the substrate via rigid particles within pores (also referred to herein as beads within pores).
[0068] In one example, the array substrate may be a bundle of optical fibers or an array of optical fibers as described in US 6,023,540, US 6,200,737, and / or US 6,327,410. The bundle of optical fibers or the array of optical fibers may have probes attached directly to the fibers or attached by beads. Those skilled in the art will be able to readily determine which substrate is most suitable for an array in any aspect of the invention. WO2004110246 further discloses other substrates that may be used in arrays in any aspect of the invention, as well as methods for attaching beads to the substrate.
[0069] In one instance, the substrate surface may be physically modified to enable probe attachment or the creation of array sites. For example, the substrate surface may be modified to include chemically modified sites that can be used to covalently or non-covalently attach probe molecules or particles with attached probe molecules. Probes can be attached using any of a variety of methods known in the art, including inkjet printing methods, spotting techniques, photolithographic synthesis methods, or printing methods using masks. These techniques are disclosed in more detail in WO2004110246.
[0070] In one example, the array according to any aspect of the invention can be a bead-based array, wherein the beads are associated with a solid support, such as those commercially available from Illumina Corporation (San Diego, California). Bead arrays that can be used in any aspect of the invention can also be in fluid form, such as the fluid flow of a flow cytometer or similar device. Commercially available fluid forms used to distinguish the beads include, for example, those used in Luminex Corporation's XMAP™ technology or Lynx Therapeutics' MPSS™ method.
[0071] As used herein, the terms “solid support,” “support,” and “substrate” are used interchangeably and refer to a material or group of materials having a rigid or semi-rigid surface. In many instances, at least one surface of the solid support is substantially flat, although in some instances, it may be necessary to physically separate the synthesis regions of different compounds by means of, for example, holes, raised areas, pins, etched trenches, etc.
[0072] The arrays or microarrays according to any aspect of the invention can be ultra-high density arrays, such as those having approximately 10,000,000 probes / cm². 2 Approximately 2,000,000,000 probes / cm 2 Or approximately 100,000,000 probes / cm 2 Approximately 1,000,000,000 probes / cm 2 The array. High-density arrays are particularly useful according to any aspect of the invention for containing a large number of CpG sites on the array.
[0073] Arrays according to any aspect of the invention can be used to simultaneously or sequentially analyze or evaluate multiple loci as needed. In one instance, multiple different probe molecules can be attached to a substrate or otherwise spatially distinguished in the array. Each probe is typically specific to a particular locus and can be used to distinguish the methylation state of that locus.
[0074] In this article, the terms “probe molecule” or “probe” are used interchangeably to refer to a surface-immobilized molecule that can be recognized by a specific target. Probes used in the array may be specific to methylated alleles at CpG sites, unmethylated alleles at CpG sites, or both, or to methylated alleles at non-CpG sites, unmethylated alleles at non-CpG sites, or both.
[0075] As used herein, the term "target" refers to a molecule that has an affinity for a given probe molecule. Targets can be naturally occurring or artificial molecules. Furthermore, they can be used in an unchanged state or as aggregates. Targets can be attached covalently or non-covalently to binding members, directly or through specific binding substances. Examples of targets that can be used according to any aspect of the invention are methylated and unmethylated CpG sites. Targets are sometimes referred to in the art as anti-probes. The term "target" as used herein is not intended to create a difference in meaning.
[0076] As used herein, the term "complementary" refers to hybridization or base pairing between nucleotides or nucleic acids, such as between the two strands of a double-stranded DNA molecule, or between an oligonucleotide primer and a primer binding site on a single-stranded nucleic acid to be sequenced or amplified. Complementary nucleotides are typically A and T (or A and U), or C and G. Two single-stranded RNA or DNA molecules are considered complementary when, after optimal alignment and comparison and appropriate insertion or deletion of nucleotides, the nucleotides of one strand pair with at least about 80%, typically at least about 90% to 95%, more preferably about 98% to 100% of the nucleotides of the other strand. Perfect complementarity means 100% complementarity in sequence length. For example, a 25-base probe is perfectly complementary to a target when all 25 bases of the probe are complementary to a consecutive 25-base sequence of the target, and there is no mismatch between the probe and the target in probe length.
[0077] According to one aspect of the present invention, an in vitro method for predicting CHO test cell classification is provided, the method comprising the following steps: a) Determine the DNA methylation level at specific CpG sites in genomic DNA extracted from CHO test cells, and b) Analyze the methylation levels obtained in step (a) using a machine learning-based model to predict the expression levels of the target phenotype in CHO test cells. Wherein, the specific CpG site is a parameter of the target phenotypic biomarker, and The classification of the CHO test cells is determined by using any aspect of the present invention (i.e., the first aspect).
[0078] According to another aspect of the present invention, an in vitro method for predicting CHO test cell classification is provided, the method comprising the following steps: a) Determine the DNA methylation level of specific CpG sites in genomic DNA extracted from CHO test cells, and multiply it by their respective weighting coefficients to obtain the weighted methylation rate of those CpG sites. b) Analyze the weighted methylation rate obtained in step (a) using a machine learning-based model to predict the expression level of the target phenotype in CHO test cells. Wherein, the specific CpG sites and their respective weighting coefficients are parameters of the target phenotypic biomarker, and The classification of the CHO test cells is determined by using any aspect of the present invention (i.e., the first aspect).
[0079] According to another aspect of the present invention, a computer-implemented method for establishing a biomarker for a first target phenotype of a CHO cell population is provided, the method comprising the following steps: (a) Input the methylation values of all CpG sites in the genomic DNA of a CHO cell population; the CHO cell population expresses a specific target phenotype related to cell fitness and is part of the training sample; (b) Identify and determine specific CpG sites exhibiting consistent and reproducible methylation values from all CpG sites in step (a); and (c) Use a machine learning-based model to correlate the CpG methylation level of the CpG sites obtained in step (a) with the target phenotype; This allows us to obtain specific CpG sites with corresponding weight coefficients as parameters, which are then used as parameters to define the first target phenotypic biomarker.
[0080] Specifically, the machine learning-based model is a classifier algorithm that can be selected from the following group: random forest, decision tree, support vector machine (SVM), K nearest neighbors (KNN), neural network, multilayer perceptron, and Gaussian mixture model.
[0081] According to another aspect of the present invention, a computer program loaded into a computer memory is provided, which implements the method according to any aspect of the present invention.
[0082] According to another aspect of the invention, a tangible computer-readable medium is provided, which contains computer-readable code that, when executed by a computer, causes the computer to perform a method according to any aspect of the invention.
[0083] According to another aspect of the invention, there is provided the use of a medium according to any aspect of the invention for predicting the classification of CHO test cells, wherein the classification of said cells is based on the expression of at least one target phenotype of said cells.
[0084] Unless otherwise stated, all percentages (%) given are mass percentages.
[0085] The embodiments described below are examples of the invention, but are not intended to limit the scope of the invention (which is obvious from the entire specification and claims) to the embodiments specified in the examples.
[0086] Brief description of the attached figures Figure 1 is a graph showing that IgG productivity is classified into three groups based on productivity level, with an equal number of samples in each group: low, medium, and high productivity groups.
[0087] Figure 2 is a receiver operating characteristic (ROC) curve plotted using support vector machine (SVM) for the three productivity groups.
[0088] Example Example 1: Predicted production rate from CHO cells (pg / cell / day) Wet experimental methods In this experiment, 60 transgenic CHO clones were cultured in basal medium at 37°C, 8% CO2, and a shaking speed of 225 rpm. The 60 transgenic CHO cell lines included low-yield (specific production rate <10 pg / cell / day), medium-yield (specific production rate 10-20 pg / cell / day), and high-yield (specific production rate >20 pg / cell / day). Culture flasks were seeded at 3E5 viable cells / mL on day 0. Appropriate feed was added on days 3, 5, 7, 9, and 11. When glucose dropped below 2 g / L, it was replenished to 6 g / L using 45% glucose. The 60 clones were fed batch cultured for 14 days. Cell counts, cell viability, and specific production rate were measured on day 14, and cell pellets were collected on day 14.
[0089] DNA extraction DNA was extracted using the PureLink Genomic DNA Isolation Mini Kit (Invitrogen), including RNase treatment according to the manufacturer's instructions. DNA quantity was measured using the PicoGreen assay, and DNA quality was assessed using NanoDrop (ThermoScientific) to ensure an A260 / 280 ratio ≤ 1.8. Small samples were then analyzed using automated electrophoresis on a TapeStation (Agilent) to ensure each sample contained high molecular weight DNA.
[0090] Sequencing analysis Genomic DNA (500 ng) from the samples was used to prepare whole-genome bisulfite sequencing (WGBS) libraries. Sequencing of the libraries was performed by a third party on the NovaSeq platform, generating 125 GB of data per sample with 20X coverage.
[0091] Bioinformatics methods Preprocessing of whole-genome bisulfite sequencing (WGBS) data The raw WGBS data underwent quality control and then adapter trimming was performed using trim galore (https: / / www.bioinformatics.babraham.ac.uk / projects / trim_galore / ). Paired reads were then mapped to the reference genome [CriGri-PICRH-1.0](https: / / www.ncbi.nlm.nih.gov / assembly / GCF_003668045.3 / ) using Bismark (https: / / www.bioinformatics.babraham.ac.uk / projects / bismark / ). Bismark was also used to perform deduplication and extract the number of reads with methylated CpG and the total number of reads at that position. The methylation ratio was determined by dividing the number of methylated reads by the total number of reads.
[0092] Preprocessing of methylation array data Custom chip array data processing was performed using R version 4.1.2 with sesame version 1.14.2. The DNA methylation level at each site was calculated as a methylation β value. The β value was defined as methylated signal / (methylated signal + unmethylated signal). It could be calculated using the getBetas function. The SeSAMe procedure (Zhou et al., 2018) was used to generate normalized β values and perform quality control. Low-intensity detection calls and generation (based on p-values) were performed using pOOBAH. Background subtraction based on the out-of-band probe noob (Triche et al., 2013) with normal exponential deconvolution was also implemented, with optional additional crosstalk subtraction.
[0093] Model building A classifier is constructed to predict phenotypes based on the methylation ratio of each CpG. The classifier is based on a random forest. Hyperparameters of the random forest model, such as the number of trees and maximum depth, are tuned using cross-validation. Model performance is evaluated using prediction score metrics, such as accuracy or the area under the receiver operating characteristic curve (ROC AUC), to determine the optimal hyperparameters. The classifier is then fitted to the full dataset using the fine-tuned hyperparameters. Feature coefficients or feature importance are analyzed to understand their impact on phenotypes.
[0094] Example 2: Predicting growth status from CHO cells in terms of viable cell density (cells / mL) Wet laboratory method In this experiment, 60 transgenic CHO clones were cultured in fed basal medium at 37°C, 8% CO2, and with shaking at 225 rpm. The 60 transgenic CHO cell lines included slow-growing (viable cell density <1E7 cells / mL), medium-growing (viable cell density 1E7-3E7 cells / mL), and fast-growing (viable cell density >3E7 cells / mL). Culture flasks were seeded at 3E5 viable cells / mL on day 0. Appropriate feeding was added on days 3, 5, 7, 9, and 11. When glucose dropped below 2 g / L, it was replenished to 6 g / L using 45% glucose. The 60 clones were fed batch cultured for 14 days. Cell counts and viability were measured on day 7, and cell pellets were collected on day 7.
[0095] The growth status of the tested CHO cells (in terms of viable cell density (cells / mL)) was predicted using the same DNA extraction, sequencing analysis, and bioinformatics methodological steps as in Example 1.
[0096] Example 3: CHO cell culture The cells used in this study belonged to the Humira431 CHO cell line (A*Star BTI), which is derived from CHO DG44 cells modified to express a therapeutic antibody (adalimumab biosimilar). Cells were grown at a rate of 3 x 10⁻⁶ cells / year. 5 Cells / mL were seeded in 30 mL of culture medium and placed in 125 mL shake flasks. Culture conditions were maintained at 37°C, 8% CO2, and shaken at 150 rpm. For fed-batch cultures, 10% v / v glucose-free EX-CELL Advanced CHO Feed 1 (24368C, Merck) was added every other day from day 3 to day 14, and D-(+)-glucose solution (G8769, Merck) was added when the glucose concentration dropped to 2 g / L. From day 3 until the end of culture, samples were taken daily from each culture flask for cell counting and metabolite analysis. Viable cell density (VCD) was determined using a Vi-CELL BLU analyzer (Beckman Coulter), and metabolite concentrations were measured using a Cedex Bio analyzer (Roche).
[0097] Whole-genome bisulfite sequencing (WGBS): Humira CHO samples were cultured using fed-batch or batch culture methods, IgG titers were measured, and then whole-genome bisulfite sequencing (WGBS) was performed.
[0098] Bisulfite conversion was performed using the EZ DNA Methylation-Gold™ Kit, and library preparation was performed using the VAHTS UniversalPro DNA Library Preparation Kit. Whole-genome bisulfite sequencing (WGBS) was performed using an Illumina NovaSeq 6000.
[0099] Data processing: Raw WGBS data underwent quality control using FastQC and adapter trimming using TrimGalore. Aligned reads were mapped to the CriGri-PICRH-1.0 reference genome using Bismark, which was also used for deduplication and β-value extraction for each CpG site. Only samples with more than 10 million CpG sites and a coverage greater than 10 were retained for subsequent analysis.
[0100] To ensure the reliability of the methylation data and its suitability for model building, all samples needed to share the same CpG sites and have sufficient coverage. Therefore, various coverage thresholds were explored to determine the optimal level. A threshold of 10 was chosen based on a balance between its stringency and the retention of a large number of CpG sites. Culture-type-related CpG sites were removed to avoid confounding effects, refining the dataset to 543,613 CpG sites from 114 samples, suitable for modeling.
[0101] Based on productivity levels, IgG productivity was categorized into three groups, with equal sample sizes in each group: low, medium, and high productivity groups (e.g., ...). Figure 1 (As shown).
[0102] Machine learning model development: Three machine learning classifiers—Random Forest (RF), Logistic Regression, and Support Vector Machine (SVM)—were evaluated to predict outcomes based on a dataset. The implementation and hyperparameter tuning of these models were performed using the SciKit-Learn library, with systematic hyperparameter optimization utilizing GridSearchCV.
[0103] result: The performance of the three models was evaluated using 5-fold cross-validation, and key metrics such as mean accuracy and mean error rate were calculated. The results are summarized in Table 1.
[0104] Support Vector Machine (SVM) showed the highest mean accuracy (0.792) and the lowest mean error rate (0.208) among the tested models. To further evaluate the discriminative ability of the SVM model, receiver operating characteristic (ROC) curves were plotted for each class, and the corresponding AUC values were calculated. Figure 2 ). Figure 2The results show that the SVM model provides robust classification performance, with an AUC of 0.98 for low-level classes, 0.86 for mid-level classes, and 0.94 for high-level classes. These results highlight the model's strong ability to distinguish between different classes.
[0105] Table 1: Model Evaluation
Claims
1. A method for constructing a biomarker for a first target phenotype of CHO cells, the method comprising the following steps: (a) Determine the methylation value of all CpG sites in the genomic DNA of a CHO cell population; the CHO cell population represents the first target phenotype and is part of the training sample; (b) Identify a set of specific CpG sites from the training samples of step (a), wherein the CpG sites have consistent and reproducible methylation values; as well as (c) Use a machine learning-based model to process the methylation values of step (b) and the target phenotype of the training samples; This allows for the acquisition of specific CpG sites with corresponding weighting coefficients, which define biomarkers for the first target phenotype.
2. The method according to claim 1, wherein the machine learning-based model is a classifier algorithm selected from the group consisting of: random forest, decision tree, support vector machine (SVM), K nearest neighbors (KNN), neural network, multilayer perceptron, and Gaussian mixture model.
3. The method according to claim 1 or 2, wherein steps (b) and (c) are performed on a computer.
4. The method according to any one of the preceding claims, wherein steps (a)-(c) are repeated for a second target phenotype and subsequent target phenotypes, such that each target phenotype corresponds to at least one biomarker, to generate a set of biomarkers for different target phenotypes.
5. The method according to any one of the preceding claims, wherein the DNA methylation value is determined using a bead-based DNA methylation array.
6. The method according to any one of claims 1-4, wherein in step (a) thionite sequencing is used to determine the methylation rate and read coverage of CpG sites; and in step (b) thionite sequencing is used to determine the specific CpG site set by cutoff value.
7. The method of claim 6, wherein the minimum coverage cutoff value determined in step (b) is 3.
8. An in vitro method for predicting CHO test cell classification, the method comprising the following steps: a) Determine the DNA methylation level at specific CpG sites in genomic DNA extracted from CHO test cells. b) Analyze the methylation levels obtained in step (a) using a machine learning-based model to predict the expression levels of the target phenotype in CHO test cells. Wherein, the specific CpG site is a parameter of the target phenotypic biomarker, and The classification of the CHO test cells is determined by the target phenotype expressed by the CHO test cells, wherein the biomarker is determined using the method of any one of claims 1 to 7.
9. The method according to any one of the preceding claims, wherein the target phenotype is selected from the group consisting of: phenotypic homogeneity, protein mass, optimal carbohydrate metabolism, optimal amino acid metabolism, optimal lipid metabolism, optimal heterologous protein production, optimal cell viability, and combinations thereof.
10. The method according to any one of the preceding claims, wherein the specific CpG sites are distributed in low methylation regions (LMR), CpG islands, variable methylation sites and / or differential methylation regions in the CHO cell genome.
11. The method according to any one of claims 8 to 10, wherein the DNA methylation level in step (a) is determined using a bead-based DNA methylation array.
12. A computer-implemented method for constructing a biomarker for a first target phenotype of a CHO cell population, the method comprising the following steps: (a) Input the methylation values of all CpG sites in the genomic DNA of a CHO cell population; the CHO cell population expresses a specific target phenotype related to cell fitness and is part of the training sample; (b) Identify and determine specific CpG sites that exhibit consistent and reproducible methylation values from all CpG sites in step (a); as well as (c) Use a machine learning-based model to correlate the CpG methylation level of the CpG sites obtained in step (a) with the target phenotype; This allows for the acquisition of specific CpG sites with corresponding weighting coefficients, which are then used as parameters for defining biomarkers of the first target phenotype.
13. The method of claim 12, wherein the machine learning-based model is a classifier algorithm selected from the group consisting of: random forest, decision tree, support vector machine (SVM), K nearest neighbors (KNN), neural network, multilayer perceptron, and Gaussian mixture model.
Citation Information
Patent Citations
safe fire lighting device
CH15500A
Alternative substrates and formats for bead-based array of arrays TM
US20020102578A1
Bisulfite Conversion Reagent
US20100112595A1
Method for the Selection of a Long-Term Producing Cell using Histone Acylation as Markers
US20170081732A1
Fiber optic sensor with encoded microspheres
US6023540A