Identifying disease subtypes
By using machine learning algorithms based on high-throughput gene expression data, gene expression characteristics of disease endotypes are generated, solving the problem of heterogeneity in complex disease endotypes, enabling accurate identification of disease subtypes and prediction of targeted therapies, and improving the efficiency of drug discovery and clinical trials.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SANOFI SA(FR)
- Filing Date
- 2024-10-24
- Publication Date
- 2026-06-19
Smart Images

Figure FT_1 
Figure FT_2 
Figure FT_3
Abstract
Description
Priority Statement
[0001] This application claims priority to European application No. 23315401.2, filed on 27 October 2023. The entire contents of the foregoing document are incorporated herein by reference. Technical Field
[0002] This disclosure relates to the identification of disease subtypes based on gene expression data. Background Technology
[0003] Complex diseases can exhibit significant heterogeneity, manifested in patient subgroups that differ in pathophysiology, disease progression, and response to therapeutic treatments. This heterogeneity represents a major obstacle to translational efforts aimed at discovering novel therapies and designing and conducting clinical trials. A deeper understanding of disease at the endotype (underlying cellular and molecular mechanisms) rather than the phenotype (observable clinical / pathological characteristics) has two important implications for drug discovery. First, it accelerates the implementation of more effective and targeted therapies against specific disease endotypes that share molecular mechanisms. Second, it enables the pre-selection of patients most likely to respond to a given therapy in clinical trials based on their disease endotype, thereby increasing the probability of trial success. Disease endotypes can be explored, for example, through high-throughput gene expression data, including RNA-seq data. Therefore, systems and methods are needed for the efficient identification of disease endotypes based on high-throughput gene expression data. Summary of the Invention
[0004] This paper provides systems and methods for identifying disease endotypes based on high-throughput gene expression data (e.g., transcriptome data). The systems and methods presented can be applied to diseases with relevant tissue transcriptome profiling data. This approach can be used to identify reproducible disease endotypes, and relevant disease endotypes (or patient stratification) can be included as part of precision medicine strategies to assess drug treatment response based on disease endotype. The systems and methods presented can include a computational framework for achieving transcriptome-driven unbiased disease subtype identification, reproducibility assessment, and phenotypic characterization. Including multiple cohorts and performing systematic matching to identify reproducible subtypes makes this approach very rigorous, and the results are biologically interpretable. The systems and methods presented can also include an unsupervised machine learning approach for processing gene expression data to identify disease endotypes, supporting the development of precision medicine strategies and targeted therapies. These systems and methods provide an experimental and computational analysis framework designed to leverage drug characterization genes to: i) identify reproducible disease endotypes; ii) predict responses to specific drugs for each patient endotype; and iii) discover genomics-based classifiers to assign patients to endotypes, thereby translating into applications for stratified clinical trials.
[0005] In a first aspect, this disclosure provides a method for determining a disease endotype in a subject suffering from a disease, the method comprising: defining one or more gene expression features corresponding to one or more endotypes of the disease based on gene expression data of a plurality of target genes; measuring the expression level of each of a plurality of genes in a biological sample from the subject; generating a gene expression feature of the subject based on the expression level of each of the plurality of genes in the biological sample from the subject; comparing the subject's gene expression feature with one or more gene expression features corresponding to one or more endotypes of the disease; and assigning a disease endotype to the subject suffering from the disease based on the comparison of the gene expression features. In some embodiments, the method further comprises predicting a response to exogenous therapy based on the disease endotype assigned to the subject suffering from the disease. In some embodiments, the gene expression data is RNA-seq data. In some embodiments, the gene expression data of the plurality of target genes is from two or more previously generated datasets. In some embodiments, the definition step further comprises using a machine learning algorithm. In some embodiments, the machine learning algorithm further comprises K-means clustering. In some embodiments, the gene expression feature utilizes K-means clustering. In some embodiments, the disease is cancer. In some embodiments, the disease is an autoimmune disease. In some embodiments, the biological sample is a blood sample. In some embodiments, the biological sample is a urine sample.
[0006] On the other hand, this disclosure provides methods for treating a subject's disease, methods comprising: obtaining the expression level of each of a plurality of genes in a biological sample obtained from the subject; assigning the subject's disease to a disease endotype based on the expression level of each of the plurality of genes in the biological sample, the disease endotype being selected from a set of endotypes previously determined based on gene expression data derived from a plurality of datasets; and administering a therapeutic agent to the subject, wherein assigning the subject's disease to the disease endotype indicates that the therapeutic agent is predicted to be effective in treating the subject's disease. In some embodiments, the biological sample is a blood sample. In some embodiments, the biological sample is a urine sample. In some embodiments, the expression level of one or more of the plurality of genes of the disease endotype is elevated relative to genes not assigned to the disease endotype. In some embodiments, the expression level of one or more of the plurality of genes of the disease endotype is decreased relative to genes not assigned to the disease endotype.
[0007] On the other hand, this disclosure provides a computer-implemented method for determining the endotype of a subject suffering from a disease, the method comprising: receiving, by a computing device including a processor programmed to execute software instructions in memory, a set of gene expression data, the set of gene expression data including the expression level of each of a plurality of genes in a biological sample obtained from the subject; applying a machine learning algorithm to generate a classification model for a classification scheme, wherein the machine learning algorithm has been trained using a hierarchical training-test partitioning of the set of gene expression data; applying the classification scheme by the computing device to rank the set of gene expression data; generating by the computing device a set of simplified genes capable of distinguishing endotypes; and determining the endotype of the subject suffering from the disease based on the expression levels of the set of simplified genes capable of distinguishing endotypes. In some embodiments, these methods further include predicting a response to exogenous treatment based on the endotype determined for the subject suffering from the disease. In some embodiments, the machine learning algorithm includes feature selection. In some embodiments, the feature selection is forward feature group selection (FFGS).
[0008] In another aspect, this disclosure provides a computational system for determining a disease endotype of a subject suffering from a disease. The computational system includes: a data server for receiving gene expression data of a plurality of target genes and measurements of the expression levels of each of the plurality of genes from a biological sample of the subject; a computing device communicatively connected to the data server, the computing device including an application server configured to: define one or more gene expression characteristics corresponding to one or more endotypes of the disease based on the gene expression data; generate gene expression characteristics of the subject based on the expression levels of each of the plurality of genes from the biological sample of the subject; compare the subject's gene expression characteristics with one or more gene expression characteristics corresponding to one or more endotypes of the disease; and assign a disease endotype to the subject suffering from the disease based on the comparison of the gene expression characteristics; and a display communicatively connected to the computing device and configured to display a report describing the assignment of the disease endotype to the subject suffering from the disease.
[0009] In some embodiments, the application server is further configured to predict the response to exogenous treatment based on the disease endotype assigned to the subject suffering from the disease. In some embodiments, the gene expression data is RNA-seq data. In some embodiments, the gene expression data for multiple target genes comes from two or more previously generated datasets. In some embodiments, the definition step further includes using a machine learning algorithm. In some embodiments, the machine learning algorithm further includes K-means clustering. In some embodiments, the gene expression features are generated using K-means clustering. In some embodiments, the disease is cancer. In some embodiments, the disease is an autoimmune disease. In some embodiments, the biological sample is a blood sample. In some embodiments, the biological sample is a urine sample.
[0010] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. This document describes the methods and materials used in this invention; other suitable methods and materials known in the art may also be used. These materials, methods, and examples are illustrative only and are not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated herein by reference in their entirety. In the event of any conflict, this specification (including definitions) shall prevail.
[0011] Other features and advantages of the invention will become apparent from the following detailed description, accompanying drawings, and claims. Attached Figure Description
[0012] This patent or application document contains at least one drawing in color. Upon request and payment of the necessary fees, the Patent Office will provide a copy of this patent or patent application publication with the color drawing.
[0013] Figure 1A This is a schematic diagram illustrating the steps of the method disclosed herein, wherein a biological sample is obtained from a subject and treated with an activator (e.g., a potential therapeutic agent). The biological sample may include various cell types, and gene expression profiles may be generated based on gene expression data generated from the biological sample before or after treatment with the activator.
[0014] Figure 1B This diagram illustrates the steps of the method disclosed in this paper, in which gene expression characteristics of patient subgroups are derived from multiple datasets of gene expression data. Specifically, cluster analysis is performed and the clusters are systematically matched and validated across datasets to generate various disease endogenous types based on gene expression characteristics.
[0015] Figure 2AThis is a schematic diagram of the steps of the method disclosed in this paper, in which patients can be assigned to disease types based on gene expression characteristics and associated disease endotypes, based on individual patient gene expression data.
[0016] Figure 2B This is a schematic diagram of the steps of the method disclosed herein, wherein biological samples are obtained from subjects, gene expression data are generated from the biological samples, and based on gene expression characteristics and associated disease endotypes, a therapeutic agent predicted to be effective for a specific disease endotype can be administered to patients with that disease endotype.
[0017] Figure 3A This is a schematic diagram of the steps of the method disclosed in this paper, in which disease endotypes are determined based on gene expression data derived from each dataset.
[0018] Figure 3B This is a schematic diagram of the steps of the method disclosed in this paper, in which systemic disease intratype matching is performed across multiple datasets.
[0019] Figure 3C This is a schematic diagram of the steps of the method disclosed in this paper, wherein each disease endotype is characterized by its gene expression characteristics.
[0020] Figure 4A This is a schematic diagram of the steps of the method disclosed in this paper, in which multiple disease endotypes are determined based on gene expression data derived from multiple datasets, and subsets of disease endotypes are validated across multiple datasets to confirm the validity of the disease endotypes.
[0021] Figure 4B This is a schematic diagram of the steps of the method disclosed in this paper, in which gene expression characteristics of each cell type are compared across multiple datasets in order to characterize and validate the identified disease endotype.
[0022] Figure 5 This is a diagram of computer system components that can be used to implement methods for identifying disease subtypes based on gene expression data.
[0023] Figure 6 It is a graph that uses research groups as training and testing datasets to build classifiers, apply classifiers, and evaluate classifier performance. Detailed Implementation
[0024] Determining the gene expression characteristics of disease endotypes
[0025] This disclosure provides systems and methods for identifying disease endotypes based on high-throughput gene expression data, such as transcriptome data including expression-based analyses of multiple targets. Typically, the method includes: (a) establishing activation features based on responses to exogenous therapy; (b) determining reproducible disease endotypes based on multiple gene expression datasets and expression levels of multiple targets; (c) evaluating the activation features across multiple disease endotypes; and (d) assigning exogenous therapy to these endotypes based on activation scores of features from one or more disease endotypes. Measuring the expression levels of multiple targets in a sample may include applying the sample to a microarray. Measuring the expression levels of multiple targets in a sample may include generating RNA-seq data. In some instances, measuring expression levels may include using an algorithm. The algorithm may be used to generate a classifier. Alternatively, the classifier may provide probe-selective regions. In some instances, measuring the expression levels of multiple targets includes detecting and / or quantifying multiple targets. In some embodiments, measuring the expression levels of multiple targets includes sequencing and quantifying multiple targets. In some embodiments, determining the expression levels of multiple targets includes amplifying the multiple targets. In some embodiments, determining the expression levels of multiple targets includes quantifying the multiple targets. In some embodiments, determining the expression levels of multiple targets includes performing multiple-response experiments on the multiple targets.
[0026] In some embodiments, such as Figure 1A As illustrated in the diagram, biological samples can be obtained from a subject and treated with an activator (e.g., a potential therapeutic agent). Biological samples can include various cell types (e.g., Figure 1A The cell types (A, B, C, D, and E) can be identified, and gene expression profiles can be generated based on gene expression data produced from the biological sample before or after treatment with an activator. For example... Figure 1B As shown, gene expression datasets can be generated for each cell type, and after clustering matching and validation, reproducible disease subtypes can be generated for each cell type based on the response to the activator.
[0027] like Figure 2A As shown, based on gene expression characteristics and associated disease endotypes, patients can be assigned to disease endotypes based on their individual gene expression data. Figure 2B As shown, biological samples can be obtained from subjects, gene expression data can be generated from biological samples, and based on gene expression characteristics and associated disease endotypes, therapeutic agents predicted to be effective for a specific disease endotype can be administered to patients with that endotype.
[0028] In some embodiments, such as Figure 3AAs illustrated in the diagram, disease endotypes are determined in an unbiased manner based on gene expression data derived from each dataset. In some embodiments, disease endotypes are determined based on 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more datasets (see [reference]). Figure 3A In some embodiments, systemic disease type matching is performed across multiple datasets (see [link to documentation]). Figure 3B In some embodiments, each disease type is characterized by its gene expression signature (see [link to relevant documentation]). Figure 3C ).
[0029] In some embodiments, high-throughput gene expression data (e.g., transcriptome data including expression-based analyses of multiple targets) are generated from biological samples. In some embodiments, the biological samples contain nucleic acids (e.g., RNA or DNA). In some instances, the multiple targets include at least about 2, at least about 3, at least about 4, at least about 5, at least about 6, at least about 7, at least about 8, at least about 9, or at least about 10 targets. In some instances, the multiple targets include coding targets, non-coding targets, or any combination thereof. In some instances, coding targets contain exon sequences. In other instances, non-coding targets contain non-exon sequences or exon sequences. Alternatively, non-coding targets contain UTR sequences, intron sequences, antisense, or non-coding RNA transcripts. In some instances, non-coding targets contain sequences that partially overlap with UTR sequences or intron sequences. Non-coding targets also contain non-exon transcripts and / or exon transcripts. Exon sequences can contain regions of protein-coding genes, such as exons, UTRs, or portions thereof. Non-exon sequences can contain regions of protein-coding genes, non-protein-coding genes, or portions thereof. For example, non-exon sequences can contain intron regions, promoter regions, intergenic regions, non-coding transcripts, exon antisense regions, intron antisense regions, UTR antisense regions, non-coding transcript antisense regions, or portions thereof. In other instances, multiple targets may contain non-coding RNA transcripts.
[0030] Multiple targets may include one or more targets selected from the classifiers disclosed herein. The classifier may be generated by one or more models or algorithms. The one or more models or algorithms may be Naive Bayes (NB), AdaBoost (Adaptive Boosting), Recursive Partitioning (Rpart), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), High-Dimensional Discriminant Analysis (HDDA), or combinations thereof. The classifier may have an Area Under the Receiver Operating Characteristic (ROC) curve (AUC) equal to or greater than 0.60. The classifier may have an AUC equal to or greater than 0.61. The classifier may have an AUC equal to or greater than 0.62. The classifier may have an AUC equal to or greater than 0.63. The classifier may have an AUC equal to or greater than 0.64. The classifier may have an AUC equal to or greater than 0.65. The classifier may have an AUC equal to or greater than 0.66. The classifier may have an AUC equal to or greater than 0.67. The classifier may have an AUC equal to or greater than 0.68. The classifier can have an AUC of 0.69 or greater. The classifier can have an AUC of 0.70 or greater. The classifier can have an AUC of 0.75 or greater. The classifier can have an AUC of 0.77 or greater. The classifier can have an AUC of 0.78 or greater. The classifier can have an AUC of 0.79 or greater. The classifier can have an AUC of 0.80 or greater. This AUC can be clinically significant based on its 95% confidence interval (CI). The accuracy of the classifier can be at least about 70%. The accuracy of the classifier can be at least about 73%. The accuracy of the classifier can be at least about 75%. The accuracy of the classifier can be at least about 77%. The accuracy of the classifier can be at least about 80%. The accuracy of the classifier can be at least about 83%. The accuracy of the classifier can be at least about 84%. The accuracy of the classifier can be at least about 86%. The accuracy of the classifier can be at least about 88%. The accuracy of the classifier can be at least about 90%. The p-value of a classifier can be less than or equal to 0.05. The p-value of a classifier can be less than or equal to 0.04. The p-value of a classifier can be less than or equal to 0.03. The p-value of a classifier can be less than or equal to 0.02. The p-value of a classifier can be less than or equal to 0.01. The p-value of a classifier can be less than or equal to 0.008. The p-value of a classifier can be less than or equal to 0.006. The p-value of a classifier can be less than or equal to 0.004. The p-value of a classifier can be less than or equal to 0.002. The p-value of a classifier can be less than or equal to 0.001.
[0031] Multiple targets can include one or more targets selected from a random forest (RF) classifier. Multiple targets can include two or more targets selected from a random forest (RF) classifier. Multiple targets can include three or more targets selected from a random forest (RF) classifier. Multiple targets can include 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 20, 25, 30, 35, 40, 45, 50 or more targets selected from a random forest (RF) classifier. The RF classifier can be an RF2 and RF3 or RF4 classifier. The RF classifier can be an RF22 classifier (e.g., a random forest classifier with 22 targets).
[0032] Multiple targets can include one or more targets selected from the SVM classifier. Multiple targets can include 2, 3, 4, 5, 6, 7, 8, 9, 10 or more targets selected from the SVM classifier. Multiple targets can include 12, 13, 14, 15, 17, 20, 22, 25, 27, 30 or more targets selected from the SVM classifier. Multiple targets can include 32, 35, 37, 40, 43, 45, 47, 50, 53, 55, 57, 60 or more targets selected from the SVM classifier. The SVM classifier can be an SVM2 classifier.
[0033] Multiple targets can include one or more targets selected from the KNN classifier. Multiple targets can include 2, 3, 4, 5, 6, 7, 8, 9, 10 or more targets selected from the KNN classifier. Multiple targets can include 12, 13, 14, 15, 17, 20, 22, 25, 27, 30 or more targets selected from the KNN classifier. Multiple targets can include 32, 35, 37, 40, 43, 45, 47, 50, 53, 55, 57, 60 or more targets selected from the KNN classifier. Multiple targets can include 65, 70, 75, 80, 85, 90, 95, 100 or more targets selected from the KNN classifier.
[0034] In some embodiments, one or more pattern recognition methods may be used to analyze the expression levels of target sequences. Pattern recognition methods may include linear combinations or nonlinear combinations of expression levels. In some embodiments, measurements of RNA transcript expression or combinations of RNA transcript levels are constructed into linear or nonlinear models or algorithms (e.g., 'expression features') and converted into likelihood scores. These likelihood scores may indicate the probability that a biological sample originates from a patient who will benefit from a disease-specific therapy.
[0035] Additionally, likelihood scores can indicate the probability that a biological sample comes from a patient who is likely to exhibit endotype-specific prognosis or a response to treatment. Likelihood scores can be used to differentiate between these disease states. Models and / or algorithms can be provided in a machine-readable format and can be used to correlate expression levels or expression profiles with disease states and / or to specify treatment modalities for an individual or a group of patients.
[0036] Determining the expression levels of multiple targets can involve using algorithms or classifiers. High-throughput gene expression data, such as array data or RNA-seq data, can be managed, classified, and analyzed using techniques known in the art. Determining the expression levels of multiple targets can include probe set modeling and data preprocessing. Probe set modeling and data preprocessing can be derived using algorithms such as the robust multiarray (RMA) algorithm or its variants GC-RMA and fRMA, the probe log intensity error (PLIER) algorithm or its variant iterPLIER, or the single-channel array normalization (SCAN) algorithm.
[0037] Variance or intensity filters can be applied when preprocessing data using the RMA algorithm, for example, by removing target sequences with a standard deviation <10 or a mean intensity <100 normalized data range intensity units, respectively.
[0038] Alternatively, determining the expression levels of multiple targets can include using machine learning algorithms. Machine learning algorithms can include supervised learning algorithms. Examples of supervised learning algorithms can include Average Single Dependency Estimator (AODE), artificial neural networks (e.g., backpropagation), Bayesian statistics (e.g., Naive Bayes classifiers, Bayesian networks, Bayesian knowledge bases), case-based reasoning, decision trees, inductive logic programming, Gaussian process regression, grouping data processing methods (GMDH), learning automata, learning vector quantization, minimum message length (decision trees, decision graphs, etc.), lazy learning, instance-based learning, nearest neighbor algorithms, analogy modeling, probabilistic approximate correct learning (PAC), ripple descent rules (a knowledge acquisition method), symbolic machine learning algorithms, sub-symbolic machine learning algorithms, support vector machines, random forests, classifier ensembles, bagging, and boosting methods. Supervised learning can include ordered classification, such as regression analysis and informational fuzzy networks (IFN). Alternatively, supervised learning methods can include statistical classification, such as AODE, linear classifiers (e.g., Fisher linear discriminant, logistic regression, Naive Bayes classifier, perceptron, and support vector machine), quadratic classifiers, k-nearest neighbors, boosting methods, decision trees (e.g., C4.5, random forest), Bayesian networks, and hidden Markov models.
[0039] Machine learning algorithms can also include unsupervised learning algorithms. Examples of unsupervised learning algorithms can include artificial neural networks, data clustering, expectation-maximization algorithms, self-organizing maps, radial basis function networks, vector quantization, generative topology graphs, information bottleneck methods, and IBSEAD. Unsupervised learning can also include association rule learning algorithms, such as the Apriori algorithm, the Eclat algorithm, and the PP-growth algorithm. Hierarchical clustering, such as single-link clustering and concept clustering, can also be used. Alternatively, unsupervised learning can include partitioning clustering, such as the K-means algorithm and fuzzy clustering. In some instances, machine learning algorithms include reinforcement learning algorithms. Examples of reinforcement learning algorithms include, but are not limited to, temporal difference learning, Q-learning, and learning automata. Alternatively, machine learning algorithms can include data preprocessing.
[0040] In some embodiments, the machine learning algorithm may include, but is not limited to, Average Single Dependency Estimator (AODE), Fisher linear discriminant analysis, logistic regression, perceptron, multilayer perceptron, artificial neural network, support vector machine, quadratic classifier, boosting method, decision tree, C4.5, Bayesian network, hidden Markov model, high-dimensional discriminant analysis, and Gaussian mixture model. The machine learning algorithm may include support vector machine, Naive Bayes classifier, k-nearest neighbor, high-dimensional discriminant analysis, or Gaussian mixture model. In some instances, the machine learning algorithm includes random forest.
[0041] Molecular typing is a method for classifying diseases into one of several genetically distinct categories or endotypes. Each endotype can respond differently to different types of treatment, and the presence of a particular endotype can predict, for example, the predictive effectiveness of a particular therapeutic agent, a higher risk of relapse, or a good or bad prognosis for an individual's disease. As described herein, each endotype has unique molecular and clinical characteristics, as well as characteristic gene expression features. In some instances, the molecular typing methods in the systems and methods presented herein are used in combination with other biomarkers for analyzing disease endotypes.
[0042] Intra-disease typing can be used to predict whether a patient will benefit from a particular therapy. For example, a patient whose cells respond to a specific activating agent in vitro may clinically respond well to that agent, and their gene expression profile can be detected after in vitro experiments and generalized to other patients with the same endotype assigned to that gene expression profile. Therefore, a patient with gene expression profiles corresponding to a response to an activating agent in vitro can be assigned an endotype indicating a specific clinical therapy, while a patient with gene expression profiles corresponding to a lack of response to an activating agent in vitro may be a better candidate for other clinical treatment options.
[0043] As used herein, “characteristic” or “genetic characteristic” may encompass any one or more genes, one or more proteins, or epigenetic factors whose expression profile or presence is associated with a specific cell type, subtype, or cellular state within a cell population. For ease of discussion, when discussing gene expression, any one or more genes, one or more proteins, or epigenetic factors may be used interchangeably. As used herein, the terms “characteristic,” “expression profile,” or “expression program” are used interchangeably. It should be understood that, similarly, when referring to a protein (e.g., a differentially expressed protein), it may fall within the definition of a “genetic” characteristic. Expression levels, activity levels, or abundance levels can be compared between different cells to characterize or identify, for example, a characteristic specific to a cell subpopulation. Increases or decreases in the expression, activity, or abundance of a characteristic gene can be compared between different cells to characterize or identify, for example, a specific cell subpopulation. Detection of a characteristic in a single cell can be used to identify and quantify, for example, a specific cell subpopulation. A characteristic may include one or more genes, one or more proteins, or epigenetic factors whose expression or presence is specific to a particular cell subpopulation, such that the expression or presence is unique to that cell subpopulation. Therefore, as used herein, gene signatures can refer to any set of upregulated and downregulated genes representing a cell type or subtype. Gene signatures, as used herein, can also refer to any set of upregulated and downregulated genes across different cells or cell subpopulations derived from gene expression profiles. For example, gene signatures can include a list of genes that are differentially expressed under distinguishing conditions of interest.
[0044] Characteristic detection can be performed in ex vivo patient cells after treatment with a potential therapeutic agent. Characteristic detection can be used to identify and quantify, for example, cellular responses to treatment with a potential therapeutic agent. A characteristic may include one or more genes, one or more proteins, or epigenetic factors whose expression or presence is specific to cells that respond to the potential therapeutic agent, such that the expression or presence is unique to cells that respond to the therapeutic agent. Therefore, as used herein, a genetic characteristic can refer to any set of upregulated and downregulated genes representing cells that respond to a potential therapeutic agent. For example, a genetic characteristic may include a list of genes that are differentially expressed in cells that respond to a potential therapeutic agent under ex vivo conditions, relative to cells that do not respond to the potential therapeutic agent under ex vivo conditions.
[0045] Features as defined herein (which are genetic, protein, or other genetic or epigenetic features) can be used to indicate: the presence of cell type, cell type subtype, the state of the microenvironment of a cell population, a specific cell type population, or a subpopulation, and / or the overall state of the entire cell subpopulation. Furthermore, features can indicate cells within a cell population in vivo. Features can also be used to suggest, for example, specific therapies, or to track treatment, or to suggest ways of regulating the immune system. The features of the present invention can be discovered by analyzing the expression profiles of single cells within a cell population from an isolated sample (e.g., cells isolated from a patient), thereby allowing the discovery of novel cell subtypes or cell states that were previously invisible or unrecognized. The presence of a subtype or cell state can be determined by subtype-specific or cell state-specific features. The presence of these specific cell subtypes or cell states can be determined by applying the feature gene to batch sequencing data or array data in the sample. Without being bound by any particular theory, the features of the systems and methods disclosed herein can be microenvironment-specific, such as their expression in a specific spatiotemporal context. Without being bound by any particular theory, the features discussed herein are specific to a specific pathological context. Unbound by any particular theory, combinations of cell subtypes with specific characteristics can indicate a certain outcome. Unbound by any particular theory, characteristics can be used to deconvolve cellular networks present under a specific pathological condition. Unbound by any particular theory, the presence of specific cells and cell subtypes indicates a specific response to treatment, such as increased or decreased susceptibility to treatment. Characteristics can indicate the presence of a specific cell type. In one embodiment, novel characteristics are used to detect multiple cellular states or levels present in a cell subpopulation that are associated with a specific pathological condition, a specific outcome or progression of the disease, or a specific response to disease treatment.
[0046] Features according to certain embodiments of the present invention may comprise or consist of one or more genes, proteins, and / or epigenetic factors, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise two or more genes, proteins, and / or epigenetic factors, such as 2, 3, 4, 5, 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise three or more genes, proteins, and / or epigenetic factors, such as 3, 4, 5, 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise four or more genes, proteins, and / or epigenetic factors, such as 4, 5, 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise five or more genes, proteins, and / or epigenetic factors, such as 5, 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise or consist of six or more genes, proteins, and / or epigenetic factors, such as 6, 7, 8, 9, 10, or more. In some embodiments, the feature may comprise seven or more genes, proteins, and / or epigenetic factors, such as 7, 8, 9, 10, or more. In some embodiments, the feature may comprise eight or more genes, proteins, and / or epigenetic factors, such as 8, 9, 10, or more. In some embodiments, the feature may comprise nine or more genes, proteins, and / or epigenetic factors, such as 9, 10, or more. In some embodiments, the feature may comprise ten or more genes, proteins, and / or epigenetic factors, such as 10, 11, 12, 13, 14, 15, or more. It should be understood that the feature according to the invention may also include, for example, combinations of genes or proteins and epigenetic factors.
[0047] In some embodiments, a feature is characterized as specific to cell populations responding to a potential therapeutic agent under in vitro conditions if it is upregulated or present, detected, or detectable in cell populations that respond to a potential therapeutic agent under in vitro conditions, or alternatively, if it is downregulated, absent, or undetectable in cell populations that respond to a potential therapeutic agent under in vitro conditions. In this case, the feature consists of one or more differentially expressed genes / proteins or differential epigenetic factors when comparing different cell or cell subpopulations (including comparing cells that respond to a potential therapeutic agent under in vitro conditions and cells that do not respond to a potential therapeutic agent under in vitro conditions). It should be understood that "differentially expressed" genes / proteins include upregulated or downregulated genes / proteins as well as genes / proteins that are turned on or off. When referring to upregulation or downregulation, in some embodiments, such upregulation or downregulation is preferably at least two-fold, such as two-fold, three-fold, four-fold, five-fold, or more, for example, at least ten-fold, at least 20-fold, at least 30-fold, at least 40-fold, at least 50-fold, or more. Alternatively or otherwise, differential expression can be determined based on statistical tests as commonly known in the art.
[0048] As discussed herein, differentially expressed genes / proteins or differentially expressed epigenetic factors can be differentially expressed at the single-cell level or at the cell population level. Preferably, when the cell population level is involved, differentially expressed genes / proteins or epigenetic factors (such as those constituting the genetic characteristics discussed herein) refer to genes that are differentially expressed in all or substantially all cells of the population (e.g., at least 80%, preferably at least 90%, e.g., at least 95% of individual cells). This allows for the definition of specific subpopulations of tumor cells. As mentioned herein, a cell “subpopulation” preferably refers to a specific subset of cells of a particular cell type that can be distinguished or uniquely identified and differentiated from other cells of that cell type. Cell subpopulations can be phenotypic and are preferably characterized by features as discussed herein. Cell (sub)populations as mentioned herein can constitute cell subpopulations of a particular cell type characterized by a particular cell state.
[0049] Various aspects and embodiments of the present invention may relate to the analysis of genetic, protein, and / or other genetic or epigenetic characteristics based on single-cell analysis (e.g., single-cell RNA sequencing) or alternatively based on cell population or batch analysis (as defined elsewhere herein). In another aspect, the present invention relates to the genetic, protein, and / or other genetic or epigenetic characteristics of cells that are identified as responsive to therapeutic agents under in vitro conditions (as defined elsewhere herein) and thus can predict a patient's clinical response to treatment.
[0050] The systems and methods provided herein further relate to various uses of genetic, protein, and / or other genetic or epigenetic characteristics as defined herein, and various uses of disease endotypes as defined herein. Particularly advantageous uses include methods for identifying therapeutic agents predicted to be effective against a disease endotype based on genetic, protein, and / or other genetic or epigenetic characteristics as defined herein. The systems and methods provided herein further relate to identifying therapeutic agents capable of treating patients assigned to a specific disease endotype based on genetic, protein, and / or other genetic or epigenetic characteristics as defined herein.
[0051] The genetic characteristics described herein can be used to assign patients identified as having a specific disease to a specific endotype of that disease. Furthermore, the genetic characteristics described herein can be used to predict a patient's response to a potential therapeutic agent for that specific disease endotype based on the patient's identified endotype.
[0052] In some embodiments, disease endotype is determined based on gene expression data derived from multiple datasets. In some embodiments, the gene expression data derived from multiple datasets is publicly available. In some embodiments, disease endotype is determined based on 2, 3, 4, 5, 6, 7, 8, 9, or 10 or more datasets (see [link to relevant documentation]). Figure 3A In some embodiments, systemic disease type matching is performed across multiple datasets (see [link to documentation]). Figure 3B In some embodiments, each disease type is characterized by its gene expression signature (see [link to relevant documentation]). Figure 3C ).
[0053] In some embodiments, one, two, three, four, five, six, seven, eight, nine, or ten or more disease endotypes are determined based on gene expression data derived from multiple datasets, and subsets of the disease endotypes are validated across multiple datasets to confirm the validity of the disease endotypes. For example, four disease endotypes may be determined based on gene expression data derived from multiple datasets, and two of these four disease endotypes may be validated across all datasets (see [link to documentation]). Figure 4A In some embodiments, the expression signatures of individual genes can be compared across multiple datasets to validate the identified disease endotype. For example, a first subset of genes (whose upregulation is associated with assigning gene expression signatures to a first disease endotype) can be compared across multiple datasets to validate the identified disease endotype, and a second subset of genes (whose downregulation is associated with assigning gene expression signatures to a second disease endotype) can be compared across multiple datasets to validate the identified disease endotype (see [link to relevant documentation]). Figure 4B ).
[0054] In some embodiments, the systems and methods provided herein involve targeted nucleic acid profiling (e.g., sequencing, quantitative reverse transcription polymerase chain reaction, etc.). In some embodiments, the target nucleic acid molecule (e.g., RNA molecule) can be sequenced by any method known in the art, such as high-throughput sequencing, also known as next-generation sequencing or deep sequencing. Nucleic acid target molecules labeled with barcodes (e.g., source-specific barcodes) can be sequenced together with the barcodes to produce single reads and / or contigs containing both the target molecule and the barcode sequence, or portions thereof. Exemplary next-generation sequencing technologies include, for example, Illumina sequencing, Ion Torrent sequencing, 454 sequencing, SOLiD sequencing, and nanopore sequencing.
[0055] In some embodiments, targeted nucleic acid profiling is performed on biological samples obtained from a subject. Biological samples may be, for example, peripheral blood samples, urine samples, saliva samples, cerebrospinal fluid (CSF) samples, or tissue biopsies. Biopsies may be, for example, needle aspiration biopsy, fine-needle aspiration, core needle biopsy, vacuum-assisted biopsy, image-guided biopsy, endoscopic biopsy, skin biopsy, bone marrow biopsy, or surgical biopsy.
[0056] The methods disclosed in this article can be used to identify tissue and cell types of disease subtypes based on gene expression data, including but not limited to: peripheral blood mononuclear cells (PBMCs), including T cells and NK cells (TNK), monocytes and macrophages (MPh), B cells and plasma cells (BPC), neutrophils (Neut), erythrocytes (Eryth), megakaryocytes (Mega), and hematopoietic stem cells (HSC); macrophages and monocytes, including classical (CD14+, CD16-), intermediate (CD14+, CD16+), and non-classical (CD14-, CD16+) monocytes; B cells, including naive B cells (B. naive), memory B cells (MBC), and plasma cells (PC); cardiac tissue, kidney tissue, and lung tissue, including cardiac fibroblasts (CF), cardiomyocytes (CMC), endothelial cells (Endo), and hepatocytes (Hep). a) Macrophages (Mac) and smooth muscle cells (SMC); kidney tissue, including B cells (B), endothelial cells (Endo), epithelial cells (Epit), macrophages (Mac), and T cells and natural killer cell populations (TNK); and lung tissue, including adventitia cells (AC), alveolar type 1 and type 2 cells (ATC), basal cells (Basal), basophils (Baso), B and plasma cells (BPC), rod cells (CC), endothelial cells (Endo), fibroblasts (Fibro), goblet cells, serous cells and mucinous cells (GSM), pulmonary ciliated cells (LCC), mesothelial cells (MC), monocytes, dendritic cells and macrophages (MPh), neutrophils (Neut), pericytes (PC), pulmonary ionizing cells (PI), smooth muscle cells (SMC), T cells and natural killer cells (TNK).
[0057] In some embodiments, the methods disclosed herein can be used to identify disease subtypes based on gene expression data in tissues including, but not limited to, the prostate, lung, pancreas, cervix, kidney, salivary glands, uterus, stomach, thyroid, sinuses, middle and inner ear, adrenal glands, appendix, hematopoietic system, bones and joints, spinal cord, mammary glands, cerebellum, connective and soft tissues, uterine body, esophagus, eye, nose, eyeball, fallopian tube, extrahepatic bile duct, mouth, intrahepatic bile duct, kidney, appendix, larynx, lips, liver, lungs and bronchi, lymph nodes, brain, spinal cord, nasal cartilage, retina, oropharynx, endocrine glands, female reproductive organs, ovary, penis and scrotum, pituitary gland, pleura, rectum, renal pelvis, ureter, peritoneum, salivary glands, skin, small intestine, testes, thymus, thyroid, tongue, unknown, bladder, uterus, vagina, labia, or vulva. In some embodiments, the sample comprises cells selected from the group consisting of: fat, adrenal cortex, adrenal gland, adrenal medulla, appendix, bladder, blood, blood vessels, bone, osteochondral, brain, mammary gland, cartilage, cervix, colon, sigmoid colon, dendritic cells, skeletal muscle, endometrium, esophagus, fallopian tube, fibroblasts, gallbladder, kidney, larynx, liver, lung, lymph nodes, melanocytes, mesothelial lining, myoepithelial cells, osteoblasts, ovary, pancreas, parotid gland, prostate, salivary gland, sinus tissue, skeletal muscle, skin, small intestine, smooth muscle, stomach, synovium, joint lining tissue, tendon, testis, thymus, thyroid gland, uterus, and uterine body. In some embodiments, tissue is collected from healthy subjects. In some embodiments, tissue is collected from subjects with known or suspected cancer. In some embodiments, tissue is collected from solid tumors or liquid tumors.
[0058] In some embodiments, the methods disclosed herein can be used to identify disease subtypes based on gene expression data for diseases including, but not limited to, autoimmune myocarditis, anti-glomerular basement membrane nephritis, lupus nephritis, interstitial cystitis, autoimmune hepatitis, primary biliary cholangitis, primary sclerosing cholangitis, antisynthetic enzyme syndrome, alopecia areata, autoimmune angioedema, autoimmune progesterone dermatitis, autoimmune urticaria, bullous pemphigoid, cicatricial pemphigoid, herpetic dermatitis, and discoid dermatitis. Systemic lupus erythematosus, epidermolysis bullosa, erythema nodosum, pemphigoid of pregnancy, hidradenitis suppurativa, lichen planus, lichen sclerosus, linear IgA disease, morphine scleroderma, pemphigus vulgaris, psoriasis, generalized scleroderma, vitiligo, Addison's disease, autoimmune polyendocrine syndrome type 1, autoimmune polyendocrine syndrome type 2, autoimmune polyendocrine syndrome type 3, autoimmune pancreatitis, type 1 diabetes mellitus, autoimmune thyroiditis, Ord's thyroiditis Thyroiditis, Graves' disease, autoimmune oophoritis, endometriosis, autoimmune orchitis, autoimmune enteropathy, celiac disease, Crohn's disease, achalasia, microscopic colitis, ulcerative colitis, antiphospholipid syndrome, aplastic anemia, autoimmune hemolytic anemia, autoimmune lymphoproliferative syndrome, autoimmune neutropenia, autoimmune thrombocytopenic purpura, cold agglutinin disease, primary mixed cryoglobulinemia, Evans syndrome, pernicious anemia, pure red cell aplasia, thrombocytopenia, painful obesity, adult-onset Still's disease, ankylosing spondylitis, CREST syndrome, drug-induced lupus, enthesitis-associated arthritis, eosinophilic fasciitis, Felty syndrome, IgG4-related disease, juvenile arthritis, mixed connective tissue disease (MCTD), palindromic rheumatism, Parsonage-Turner syndrome. Syndrome, relapsing polychondritis, retroperitoneal fibrosis, rheumatic fever, rheumatoid arthritis, sarcoidosis, Schnitzler syndrome, systemic lupus erythematosus (SLE), undifferentiated connective tissue disease (UCTD), dermatomyositis, fibromyalgia, inclusion body myositis, myositis, myasthenia gravis, neuromuscular rigidity, paraneoplastic cerebellar degeneration, polymyositis, acute disseminated encephalomyelitis (ADEM), acute motor axonal neuropathy, anti-N-methyl-D-aspartate (anti-NMDA) receptor encephalitis, Balo concentric sclerosis, Bickerstaff encephalitis, chronic inflammatory demyelinating polyneuropathy, Guillain-Barré syndrome, Hashimoto's encephalopathyencephalopathy, idiopathic inflammatory demyelinating diseases, Lambert-Eton myasthenic syndrome, multiple sclerosis, Oshtoran syndrome, progressive inflammatory neuropathy, restless legs syndrome, stiff-person syndrome, Sydenham's chorea, transverse myelitis, autoimmune retinopathy, autoimmune uveitis, Cogan syndrome, Graves' eye disease, middle uveitis, woody conjunctivitis, Mooren's ulcer, neuromyelitis optica, strabismus oculoclonus myoclonus syndrome, optic neuritis, scleritis, Susac's syndrome Sympathetic ophthalmia, Tolosa-Hunter syndrome, autoimmune inner ear disease (AIED), Meniere's disease, Behçet's disease, eosinophilic granulomatous polyangiitis (EGPA), giant cell arteritis, granulomatous polyangiitis (GPA), IgA vasculitis (IgAV), Kawasaki disease, leukocytic clotting vasculitis, lupus vasculitis, rheumatoid vasculitis, microscopic polyangiitis (MPA), polyarteritis nodosa (PAN), polymyalgia rheumatica, urticaria vasculitis, vasculitis, and primary immunodeficiency.
[0059] Treatment
[0060] Those skilled in the art will understand that treatments, as mentioned herein, encompass enhancing or improving therapeutic efficacy. Treatments may include improving symptoms of a disease, slowing, halting, or reversing disease progression, or inhibiting or mitigating other harmful effects associated with the disease. The effectiveness of a treatment is determined in conjunction with any known methods used to diagnose or treat a particular disease. The systems and methods provided herein encompass therapeutic approaches that include any of the methods or uses discussed herein.
[0061] As used herein, the phrase "therapeuticly effective amount" refers to a non-toxic amount of a drug, agent, or compound that is sufficient to provide the desired therapeutic effect. As used herein, "patient" means anyone who is receiving or can receive medical treatment. The therapies or treatments according to the invention can be performed alone or in combination with another therapy and can be provided at home, in a doctor's office, clinic, hospital outpatient department, or hospital. Treatment is typically initiated in a hospital so that the physician can closely monitor the effects of the therapy and make any necessary adjustments. The duration of the therapy depends on the patient's age and physical condition, the stage of the disease, and the patient's response to treatment. Furthermore, individuals at higher risk of developing the disease (e.g., those with a genetic predisposition to the disease) may receive preventative treatment to suppress or delay the onset of symptoms.
[0062] Computer implementation of the method
[0063] Figure 5This is a diagram of components of a computer system 500 that can be used to implement methods for identifying disease subtypes based on gene expression data (e.g., RNA-seq data or gene expression array data).
[0064] Computing device 500 is intended to represent various forms of digital computers (such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers). Computing device 550 is intended to represent various forms of mobile devices (such as personal digital assistants, mobile phones, smartphones, and other similar computing devices). Additionally, computing device 500 or 550 may include a Universal Serial Bus (USB) flash drive. The USB flash drive may store an operating system and other applications. The USB flash drive may include input / output components, such as a wireless transmitter or USB connector that can be plugged into a USB port of another computing device. The components shown herein, their connections and relationships, and their functions are intended to be exemplary only and are not intended to limit the implementation of the methods and compositions described and / or claimed in this document.
[0065] Computing device 500 includes a processor 502, a memory 504, a storage device 506, a high-speed interface 508 connected to the memory 504 and a high-speed expansion port 510, and a low-speed interface 512 connected to a low-speed bus 514 and the storage device 506. The various components 502, 504, 506, 508, 510, and 512 are interconnected using multiple buses and may be mounted on a shared motherboard or otherwise, as appropriate. The processor 502 can process instructions executed within the computing device 500 (including instructions stored in the memory 504 or the storage device 506) to display graphical information for a GUI on an external input / output device, such as a display 516 coupled to the high-speed interface 508. In other embodiments, multiple processors and / or multiple buses, as well as multiple memories and multiple memory types, may be used as appropriate. Furthermore, multiple computing devices 500 may be connected, each providing the necessary operational components (e.g., as a server library, a blade server group, or a multi-processor system).
[0066] The memory 504 stores information within the computing device 500. In one embodiment, the memory 504 is one or more volatile memory cells. In another embodiment, the memory 504 is one or more non-volatile memory cells. The memory 504 may also be another form of computer-readable medium (such as a magnetic disk or optical disk).
[0067] Storage device 506 provides large-capacity storage for computing device 500. In one embodiment, storage device 506 may be or contain computer-readable media, such as floppy disk devices, hard disk devices, optical disk devices, magnetic tape devices, flash memory or other similar solid-state storage devices or device arrays (including storage area networks or other configured devices). The computer program product may be tangibly embodied in an information carrier. The computer program product may also contain instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-readable medium or a machine-readable medium (such as memory 504, storage device 506, or memory on processor 502).
[0068] High-speed controller 508 manages bandwidth-intensive operations of computing device 500, while low-speed controller 512 manages less bandwidth-intensive operations. This functional allocation is merely an example. In one embodiment, high-speed controller 508 is coupled to memory 504, display 516 (e.g., via a graphics processor or accelerator), and high-speed expansion port 510 which can accept various expansion cards (not shown). In this embodiment, low-speed controller 512 is coupled to storage device 506 and low-speed expansion port 514. The low-speed expansion port, which may include various communication ports (e.g., USB, Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or more input / output devices (such as keyboards, clickers, microphone / speaker combos, scanners, or networking devices (such as switches or routers)) via, for example, a network adapter. Computing device 500 can be implemented in a variety of different forms, as shown in the figures. For example, it can be implemented as a standard server 520, or multiple times as a group of such servers. It can also be implemented as part of a rack-mount server system 524. Furthermore, it can be implemented in a personal computer such as laptop computer 522. Alternatively, components from computing device 500 may be combined with other components in a mobile device (not shown) (such as device 550). Each such device may contain one or more of computing devices 500, 550, and the entire system may consist of multiple computing devices 500, 550 communicating with each other.
[0069] The computing device 500 can be implemented in a variety of different forms, as shown in the figure. For example, it can be implemented as a standard server 520, or as a group of such servers multiple times. It can also be implemented as part of a rack server system 524. Furthermore, it can be implemented in a personal computer such as a laptop computer 522. Alternatively, components from the computing device 500 can be combined with other components in a mobile device (not shown) (such as device 550). Each such device can contain one or more of the computing devices 500, 550, and the entire system can consist of multiple computing devices 500, 550 communicating with each other.
[0070] The computing device 550 includes a processor 552, memory 564, input / output devices (such as a display 554), a communication interface 566, a transceiver 568, and other components. The device 550 may also be equipped with storage devices (such as microdrives or other devices) to provide additional storage. The various components 550, 552, 564, 554, 566, and 568 are interconnected using multiple buses, and several components may be mounted on a shared motherboard or otherwise, as appropriate.
[0071] Processor 552 can execute instructions within computing device 550 (including instructions stored in memory 564). The processor can be implemented as a chipset comprising single and multiple analog and digital processors. Furthermore, the processor can be implemented using any of a variety of architectures. For example, processor 510 can be a CISC (Complex Instruction Set Computer) processor, a RISC (Reduced Instruction Set Computer) processor, or a MISC (Minimum Instruction Set Computer) processor. The processor can provide, for example, coordination with other components of device 550 (such as control of the user interface, applications running by device 550, and wireless communications performed by device 550).
[0072] Processor 552 can communicate with the user via control interface 558 and display interface 556 coupled to display 554. Display 554 can be, for example, a TFT (Thin Film Transistor Liquid Crystal Display) or OLED (Organic Light Emitting Diode) display, or other suitable display technology. Display interface 556 can include suitable circuitry for driving display 554 to present graphics and other information to the user. Control interface 558 can receive commands from the user and translate them for submission to processor 552. Additionally, an external interface 562 can be configured to communicate with processor 552, enabling device 550 to perform near-area communication with other devices. External interface 562 can provide, in some embodiments, wired communication, or in others, wireless communication, and multiple interfaces can be used.
[0073] Memory 564 stores information within computing device 550. Memory 564 may be implemented as one or more computer-readable media, one or more volatile memory cells, or one or more non-volatile memory cells. Extended memory 574 may also be provided and connected to device 550 via an extended interface 572, which may include, for example, a SIMM (Single In-line Memory Module) card interface. Such extended memory 574 may provide additional storage space for device 550, or it may also store applications or other information for device 550. In particular, extended memory 574 may include instructions for performing or supplementing the processes described above, and may also include security information. Thus, for example, extended memory 574 may be provided as a security module for device 550 and may be programmed with instructions that allow secure use of device 550. Additionally, secure applications and other information (such as placing identification information on the SIMM card in an unbreakable manner) may be provided via a SIMM card.
[0074] The memory may include, for example, flash memory and / or NVRAM memory, as discussed below. In one embodiment, the computer program product is tangibly embodied in an information carrier. The computer program product includes instructions that, when executed, perform one or more methods, such as those described above. The information carrier is a computer-readable or machine-readable medium, such as memory 564, extended memory 574, or memory on processor 552 that can be received, for example, via transceiver 568 or external interface 562.
[0075] Device 550 can communicate wirelessly via communication interface 566, which may include a digital signal processing circuit system if necessary. Communication interface 566 can provide communication under various modes or protocols, such as GSM voice calls, SMS, EMS, or MMS messaging services, CDMA, TDMA, PDC, WCDMA, CDMA2000, or GPRS. Such communication can occur, for example, via radio frequency transceiver 568. Alternatively, short-range communication can be performed using transceivers such as Bluetooth, Wi-Fi, or others (not shown). Furthermore, GPS (Global Positioning System) receiver module 570 can provide additional navigation-related and positioning-related wireless data to device 550, which can be used, as appropriate, by applications running on device 550.
[0076] Device 550 can also use audio codec 560 for audible communication, which can receive voice information from a user and convert it into usable digital information. Audio codec 560 can also generate audible sound for the user, such as through a speaker (e.g., in the handheld portion of device 550). Such sound can include sounds from voice telephone calls, recorded sounds (e.g., voice messages, music files, etc.), and sounds generated by applications operating on device 550.
[0077] The computing device 550 can be implemented in a variety of different forms, as shown in the figure. For example, it can be implemented as a mobile phone 580. It can also be implemented as a smartphone 582, a personal digital assistant, or part of another similar mobile device.
[0078] Various implementations of the systems and methods described herein can be achieved through digital electronic circuit systems, integrated circuit systems, specially designed ASICs (Application-Specific Integrated Circuits), computer hardware, firmware, software, and / or combinations of such implementations. These different implementations may include implementations within one or more computer programs that are executable and / or interpretable on a programmable system including at least one programmable processor, which may be coupled for specific or general purposes to receive data and instructions from a storage system, at least one input device, and at least one output device, and to transfer data and instructions to the storage system, at least one input device, and at least one output device.
[0079] These computer programs (also referred to as programs, software, software applications, or code) include machine instructions for a programmable processor and can be implemented in high-level procedural languages and / or object-oriented programming languages, and / or in assembly / machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, device, and / or apparatus (e.g., disk, optical disk, memory, programmable logic device (PLD)) used to provide machine instructions and / or data to a programmable processor, including machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and / or data to a programmable processor.
[0080] To provide interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user) and a keyboard and clicking device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including auditory, voice, or tactile input.
[0081] The systems and technologies described herein can be implemented in computing systems that include back-end components (e.g., as data servers), middleware components (e.g., application servers), or front-end components (e.g., client computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected via any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), and the Internet.
[0082] A computing system may include clients and servers. Clients and servers are typically geographically separated and usually interact via a communication network. The client-server relationship is established through computer programs running on the respective computers and having a client-server relationship with each other.
[0083] Several embodiments have been described. However, it should be understood that many changes can be made without departing from the spirit and scope of the invention. Furthermore, the logical flow depicted in the drawings does not require the specific order or sequence shown to achieve the desired results. Additionally, other steps may be provided from the described flow, or steps may be eliminated, and other components may be added to or removed from the described system. Accordingly, other embodiments fall within the scope of the appended claims.
[0084] The embodiments disclosed herein and all functional operations described herein may be implemented in digital electronic circuit systems, or in computer software, firmware, or hardware (including the structures disclosed herein and their structural equivalents), or in a combination of one or more of them. Embodiments of these methods and compositions may be implemented as one or more computer program products, such as one or more modules of computer program instructions encoded on a computer-readable medium for execution by or control of the operation of a data processing device. The computer-readable medium may be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of substances influencing machine-readable propagation signals, or a combination of one or more of them. The term "data processing device" encompasses all devices, apparatuses, and machines for processing data, including, for example, programmable processors, computers, or multiple processors or computers. In addition to hardware, the device may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. Propagation signals are artificially generated signals, such as machine-generated electrical, optical, or electromagnetic signals, generated for the purpose of encoding information for transmission to a suitable receiver device.
[0085] Computer programs (also known as programs, software, software applications, scripts, or code) can be written in any programming language, including compiled or interpreted languages, and can be deployed in any form, including as standalone programs or as modules, components, subroutines, or other units suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinating files (e.g., files storing one or more modules, subroutines, or portions of code). A computer program can be deployed to be executed on a single computer or on multiple computers located at a site or distributed across multiple sites and interconnected via a communication network.
[0086] The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform operations on input data and generate outputs. These processes and logic flows can also be performed by devices, and the devices can be implemented as special-purpose logic circuits, such as FPGAs (Field-Programmable Gate Arrays) or ASICs (Application-Specific Integrated Circuits).
[0087] Processors suitable for executing computer programs include, by way of example only, both general-purpose microprocessors and special-purpose microprocessors, and any one or more processors of any kind of digital computer. Typically, a processor receives instructions and data from read-only memory or random access memory, or both. Essential components of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include one or more mass storage devices (e.g., magnetic disks, magneto-optical disks, or optical disks) for storing data, or operatively coupled to receive data from or transfer data to, or both. However, a computer does not necessarily need to have such devices. Furthermore, a computer can be embedded in another device, such as a tablet computer, mobile phone, personal digital assistant (PDA), mobile audio player, global positioning system (GPS) receiver, etc. Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media, and memory devices, including, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and memory can be supplemented or incorporated into a dedicated logic circuit system.
[0088] To provide user interaction, embodiments of this disclosure can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor for displaying information to the user and a keyboard and clicking device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide user interaction; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including auditory, voice, or tactile input.
[0089] The embodiments disclosed herein can be implemented in computing systems that include back-end components (e.g., as a data server), middleware components (e.g., an application server), front-end components (e.g., a client computer with a graphical user interface or a web browser through which a user can interact with implementations of the method), or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected via any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include local area networks (“LANs”) and wide area networks (“WANs”), such as the Internet.
[0090] A computing system may include clients and servers. Clients and servers are typically geographically separated and usually interact via a communication network. The client-server relationship is established through computer programs running on the respective computers and having a client-server relationship with each other.
[0091] While this specification contains numerous details, these should not be construed as limiting the scope of the invention or the scope that may be claimed, but rather as descriptions of features specific to particular embodiments of the invention. Certain features described herein in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, different features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments. Furthermore, although features may be described above as functioning in certain combinations and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and the claimed combination may involve sub-combinations or variations thereof.
[0092] Similarly, although the operations are depicted in a specific order in the accompanying drawings, this should not be construed as requiring that such operations be performed in the specific order shown or in an ordered sequence, or that all the operations shown can be performed to achieve the desired result. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of different system components in the above embodiments should not be construed as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0093] In each instance where an HTML file is mentioned, other file types or formats can be substituted. For example, an HTML file can be replaced with XML, JSON, plain text, or other file types. Furthermore, when a table or hash table is mentioned, other data structures (such as spreadsheets, relational databases, or structured files) can be used.
[0094] Example 1 –
[0095] Figures 1A to 2B The method described in the text highlights the four stages of the method and system disclosed in this paper:
[0096] • Phase 1: Generating pathway activation signatures through whole-genome transcriptome profiling of in vitro or in vivo cell models treated with drugs such as pathway activators and pathway blockers. Figure 1A ).
[0097] • Phase 2: Identification of reproducible disease endotypes across multiple study cohorts ( Figure 1B ).
[0098] • Phase 3: Assess pathway activation signatures in the disease endotypes identified in Phase 2 ( Figure 2A ).
[0099] • Phase 4: Develop molecular classifiers for the disease types prioritized in Phase 3 to guide clinical trial design. Figure 2B ).
[0100] The four stages are described in detail below.
[0101] Stage 1: Generating drug response gene signatures
[0102] To generate gene signatures for drug responses, genome-wide expression profiling of drug-treated in vitro or in vivo models was used to identify pharmacodynamic markers hypothesized to indicate drug responsiveness. The experimental design is shown in Table 1 below. Profiling of samples treated at different time points was performed to identify genes that were upregulated or downregulated over time.
[0103] Table 1. Study design for genomic profiling studies used for drug characterization.
[0104]
[0105] Drug signature is defined as a gene that is persistently upregulated upon ligand stimulation / pathway activation and whose expression changes are reversed upon drug addition. Figure 1A Statistical modeling methods (such as linear mixed-effects (LME) modeling) were used to assess the likelihood of gene influence by ligand stimulation and drug treatment, while considering individual donor variability. Genes were screened using an FDR threshold of 0.05 and an effect size (fold change) threshold of 1.5 to generate drug characterization. Finally, additional analyses, such as pathway enrichment, were performed to evaluate drug-gene characterization relative to pathways reported in published literature.
[0106] Phase 2: Identification of reproducible disease endotypes based on genome-wide expression profiling of patient samples
[0107] For a given disease, a collection of transcriptomic profiling data is generated from multiple cohorts, and these cohorts are used to assess the reproducibility of the identified disease endotypes. In addition to transcriptomic profiling, data such as patient demographics, clinical measurements, and treatment responses are used to guide the characterization of different disease endotypes. A data-driven approach is applied to identify and characterize disease endotypes. This approach includes three steps:
[0108] Step 1 – Unbiased cluster analysis, used to identify patient clusters within each group. Figure 3A )
[0109] Starting with transcriptomic profiling data from diseased tissues, k-means clustering analysis incorporating the GAP statistic was used to identify patient subgroups within each cohort. The number of clusters (K) for each cohort was determined using the GAP statistic. Each dataset (or cohort) was analyzed independently to identify patient subgroups from each dataset.
[0110] Step 2 – Perform systematic subgroup matching across multiple groups to identify reproducible disease endotypes. Figure 3B )
[0111] To match patient subgroups from different study groups, a systematic comparison was performed between all subgroups from all groups, and subgroups from each group were grouped into clustered subtypes (i.e., endotypes) that were stable and consistent across multiple groups. This systematic comparison was performed by observing the consistency or correlation of subgroup-specific gene expression changes. Patient subgroups from different groups were considered matched (having similar gene expression characteristics) when they were best matched and had a Pearson correlation coefficient greater than 0.7, and were grouped into endotypes. Examples of finding consistent disease endotypes across multiple groups include... Figure 4A As shown in the example. Figure 4A Subgroups A and B were consistently present in all study groups and were considered candidate disease types.
[0112] Step 3 – Characterizing candidate disease types ( Figure 3C )
[0113] Once disease endotypes are identified across multiple study cohorts, these endotypes are further characterized by associated clinical features, multiple genetic modules (e.g., those related to tissue morphology, cellular function, and pathways), and responses to standard care therapies, such as... Figure 4B As shown.
[0114] Phase 3: Assessing drug response characteristics within disease patterns
[0115] In cases where disease endotypes are consistently present across multiple study cohorts, establishing a link between disease endotypes and responses to treatments of interest allows for the generation of hypotheses for targeted clinical agents. The hypothesis is that patients with higher activity of drug-response gene signatures are more likely to be better responders to specific therapies. Figure 4B In the example, endotype A was associated with higher drug response gene signature activity and was predicted to be the target patient population for a specific treatment.
[0116] Phase 4: Translating Endotypes to Clinical Practice - Developing Molecular Endotype Classifiers for Targeted Therapies
[0117] exist Figure 4BIn the example, patients with endotype A are associated with higher activity of drug-response gene signatures and are suggested as a target population for a specific therapy of interest. Molecular classifiers can identify this target patient population based on gene expression data from disease-related tissues. Figure 6 As shown, a classifier is built using a research cohort as the training dataset (this includes parameter optimization and feature selection). The classifier is then applied to test datasets, which include other cohorts independent of the training data. The classifier's performance is evaluated using metrics such as Cohen's kappa, sensitivity, specificity, and confusion matrix. Algorithms such as AdaBoost, Elastic Nets, SVM, or other machine learning methods are employed for classifier development. Cross-validation and resampling methods are used to help select optimal hyperparameters and genetic features for the classifier.
[0118] By quantifying gene expression data from patients using assay platforms such as qPCR or NanoString, a classifier can be used to assign disease endotype labels to each patient. This predictive capability enables healthcare professionals to administer the most appropriate treatment to patients based on the treatment response hypotheses developed in Phase 3 of the framework. Other embodiments
[0119] It should be understood that although the invention has been described in conjunction with its detailed description, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the appended claims.
Claims
1. A method for determining the disease endotype of a subject suffering from a disease, the method comprising: Based on gene expression data of multiple target genes, define one or more gene expression features corresponding to one or more endotypes of the disease; The expression level of each of multiple genes in a biological sample from the subject was measured; Gene expression profiles of the subject are generated based on the expression levels of each of the multiple genes in the subject's biological sample. The subject's gene expression characteristics were compared with those of one or more gene expression characteristics corresponding to one or more endotypes of the disease; as well as Based on this comparison of gene expression characteristics, the subject with the disease is assigned a disease endotype.
2. The method of claim 1, further comprising predicting the response to exogenous treatment based on the disease endotype assigned to the subject suffering from the disease.
3. The method as described in any one of claims 1 or 2, wherein, The gene expression data is RNA-seq data.
4. The method according to any one of claims 1 to 3, wherein, The gene expression data for multiple target genes are derived from two or more previously generated datasets.
5. The method according to any one of claims 1 to 4, wherein, The definition process further includes the use of machine learning algorithms.
6. The method of claim 5, wherein, The machine learning algorithm further includes K-means clustering.
7. The method according to any one of claims 1 to 6, wherein, K-means clustering was used to generate the gene expression characteristics.
8. The method according to any one of claims 1 to 7, wherein, The disease is cancer.
9. The method according to any one of claims 1 to 8, wherein, This disease is an autoimmune disease.
10. The method according to any one of claims 1 to 9, wherein, The biological sample is a blood sample.
11. The method according to any one of claims 1 to 10, wherein, The biological sample is a urine sample.
12. A method for treating a disease in a subject, the method comprising: To obtain the expression level of each of multiple genes in a biological sample obtained from the subject; Based on the expression level of each of the multiple genes in the biological sample, the subject's disease is assigned a disease endotype, which is selected from a set of endotypes previously determined based on gene expression data derived from multiple datasets. The treatment agent is administered to the subject, wherein assigning the subject's disease to the disease endotype indicates that the treatment agent is predicted to be effective in treating the subject's disease.
13. The method of claim 12, wherein, The biological sample is a blood sample.
14. The method of claim 13, wherein, The biological sample is a urine sample.
15. The method according to any one of claims 12 to 14, wherein, Compared to genes not assigned to this disease endotype, the expression levels of one or more of the multiple genes in this disease endotype are elevated.
16. The method according to any one of claims 12 to 15, wherein, Compared to genes not assigned to this disease endotype, the expression levels of one or more of the multiple genes in this disease endotype are reduced.
17. A computer-implemented method for determining the disease type of a subject suffering from a disease, the method comprising: A computing device including a processor programmed to execute software instructions in memory receives a set of gene expression data, which includes the expression level of each of a plurality of genes in a biological sample obtained from the subject; A machine learning algorithm is applied to generate a classification model for a classification scheme, wherein the machine learning algorithm has been trained using a hierarchical training-test split of the gene expression data set; The computing device applies the classification scheme to sort the gene expression data of this group; The computing device generates a set of simplified genes that can distinguish disease endotypes; Based on the expression levels of this set of simplified genes that can distinguish disease endotypes, the disease endotype of the subject with the disease is determined.
18. The computer-implemented method of claim 17, further comprising predicting a response to exogenous treatment based on the disease endotype determined for the subject suffering from the disease.
19. The computer-implemented method of claim 18, wherein, This machine learning algorithm includes feature selection.
20. The computer-implemented method of claim 19, wherein, This feature selection is the Forward Feature Group Selection (FFGS).
21. A computational system for determining the disease type of a subject suffering from a disease, the computational system comprising: A data server is used to receive gene expression data of multiple target genes and measurements of the expression levels of each of the multiple genes from the subject's biological sample. A computing device communicatively connected to the data server, the computing device including an application server configured to: Based on this gene expression data, define one or more gene expression characteristics corresponding to one or more endotypes of the disease; Gene expression profiles of the subject are generated based on the expression levels of each of the multiple genes in the subject's biological sample. The subject's gene expression characteristics were compared with those of one or more gene expression characteristics corresponding to one or more endotypes of the disease; as well as Based on this comparison of gene expression characteristics, the subject with the disease was assigned a disease endotype; And a display, which is communicatively connected to the computing device and configured to display a report describing the disease type assigned to the subject suffering from the disease.
22. The computing system of claim 21, wherein, The application server is further configured to predict the response to exogenous treatment based on the disease endotype assigned to the subject with the disease.
23. The computing system as described in any one of claims 21 or 22, wherein, The gene expression data is RNA-seq data.
24. The computing system according to any one of claims 21 to 23, wherein, The gene expression data for multiple target genes are derived from two or more previously generated datasets.
25. The computing system according to any one of claims 21 to 24, wherein, The definition process further includes the use of machine learning algorithms.
26. The computing system of claim 25, wherein, The machine learning algorithm further includes K-means clustering.
27. The computing system according to any one of claims 21 to 26, wherein, K-means clustering was used to generate the gene expression characteristics.
28. The computing system according to any one of claims 21 to 27, wherein, The disease is cancer.
29. The computing system according to any one of claims 21 to 28, wherein, This disease is an autoimmune disease.
30. The computing system according to any one of claims 21 to 29, wherein, The biological sample is a blood sample.
31. The computing system according to any one of claims 21 to 30, wherein, The biological sample is a urine sample.