Linking variants to phenotypes using a machine-learning model
A machine-learning model for genetic testing systems addresses the limitations of existing systems by predicting phenotype-affecting variants with improved accuracy and efficiency, reducing false discovery rates and resource consumption through a single genomic sample analysis.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ILLUMINA INC
- Filing Date
- 2025-12-03
- Publication Date
- 2026-06-11
AI Technical Summary
Existing genetic testing systems are limited in their ability to link variant calls to specific phenotypes due to incomplete clinical annotation of human genes, leading to inefficiencies and inaccuracies in predicting pathogenicity and phenotype expression, requiring multiple sequencing runs and consumables, and struggling with high false positive rates.
A machine-learning model is trained using genomic samples and clinical data to generate variant-to-phenotype scores, processing gene-level and variant-level features to predict the impact of variant nucleotides on phenotype expression, and provides an end-to-end automated framework for identifying phenotype-affecting variants.
The model improves accuracy and efficiency in identifying phenotype-affecting variants, reducing false discovery rates and consumable resource use by processing a single genomic sample, enabling comprehensive analysis of a broader range of genes without requiring multiple sequencing runs.
Smart Images

Figure US2025057943_11062026_PF_FP_ABST
Abstract
Description
LINKING VARIANTS TO PHENOTYPES USING A MACHINE-LEARNING MODELCROSS-REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63 / 758,227, entitled, “LINKING VARIANTS TO PHENOTYPES USING A MACHINE-LEARNING MODEL,” filed on February 13, 2025 (IP-2865 -PRV2) and U.S. Provisional Patent Application No. 63 / 727,862, entitled, “LINKING VARIANTS TO PHENOTYPES USING A MACHINE-LEARNING MODEL,” filed on December 4, 2024 (IP- 2865-PRV). Both of the aforementioned applications is hereby incorporated by reference in its entirety.BACKGROUND
[0002] In recent years, biotechnology firms and computer science institutions have improved hardware and software for genetic testing systems to determine variant calls and generate genetic diagnostics from nucleotide sequences from samples. In particular, some existing genetic testing systems generate variant calls from nucleotide reads of a genomic sample and / or run diagnostics on variant calls for a variety of purposes. For example, some existing systems perform a diagnostic application (e.g., a cancer screening assay) to test a genomic sample for genetic cancer markers by detecting, from within nucleobase calls of reads sequenced from a genomic sample, specific variants known to cause (or otherwise indicative of) certain types of cancers. But such genetic cancer markers are often limited to clinically validated variants for which a consensus of data indicates the variants cause or increase the likelihood of a particular form of cancer. Some existing genetic testing systems perform other diagnostics as well, such as deoxyribonucleic acid (DNA) sequencing to detect genetic variants known to cause (or otherwise associated with) cystic fibrosis, Huntington’s disease, von Willebrand disease, and other genetic conditions (or propensities for developing genetic conditions) or for determining other genetic traits.
[0003] Despite these recent advances, existing genetic testing systems continue to exhibit technical drawbacks or disadvantages. For example, many existing genetic testing systems are limited in their scope and utility because only a subset of human genes have been annotated with their corresponding clinical impact or phenotype based on a consensus of medical data. Indeed, many existing systems can only perform genetic diagnostics on genes that have been clinically annotated with corresponding phenotypes. Because approximately 5,000 - 6,000 genes of the approximately 20,000 human genes have been linked to particular phenotypes in genetic phenotype databases, such as Online Mendelian Inheritance in Man (OMIM) or Orphanet Rare Disease Ontology (ORDO), existing systems cannot link variant calls or reference calls for most humanAttorney Docket No. IP-2865-PCT 1 PCT Patent Applicationgenes in a sample to a particular phenotype. While some existing computational models can generate and utilize biological network data to capture relationships among human genes and their homologs beyond clinical annotations (e.g., by quantifying pathogenicity impact of genetic variants), these models nevertheless struggle (or fail entirely) to link the genetic relationships or the measures of pathogenicity to particular diseases (or other phenotypes). To leverage such computational models with variant calls from a sample, however, existing genetic testing systems often require substantial clinical analysis on a case-by-case basis while still suffering from low sensitivity in results due to high false positive rates.
[0004] In addition and in part due to the limits of identifying variants that affect phenotype expression, some existing genetic testing systems run isolated gene panels and rely on disconnected computational models that result in inefficient computational and consumable resources on a sequencing device. For example, targeted gene panels or assays for such genetic testing systems are generally limited to a set of targeted genes or a single gene to identify potential phenotypeaffecting variants, sometimes requiring multiple targeted gene panels and / or assays be implemented when attempting to detect variants outside the scope of a given targeted gene panel or assay. A round of multiple sequencing runs on a sequencing device for different, targeted gene panels can, therefore, consume unnecessary computer processing, time, and consumables (e.g., biochemical reagents) of the sequencing device. To exacerbate the limits of targeted gene panels, some existing genetic testing systems would need to rely on disconnected and disparate models or genetic phenotype databases to determine whether the variants detected from the reads of a sample affect the phenotype expressed by the organism from which the sample was taken.
[0005] Given the limits of genetic phenotype databases and targeted gene panels of existing genetic testing systems, scores and metrics from existing models that predict pathogenicity, aberrant splicing, or other measures of protein function or gene expression have more limited applicability. For instance, some existing pathogenicity prediction models generate predictions that estimate a degree to which amino-acid variants or variant nucleotides are benign or pathogenic — but without a target phenotype for measurement of benign-ness or pathogenicity. Such pathogenicity predictions can indicate whether an amino-acid variant or a variant nucleotide is likely to cause some diseases, such as certain cancers, developmental disorders, or heart conditions. But pathogenicity prediction models cannot provide targeted predictions for whether amino-acid variants or variant nucleotides corresponding to approximately 15,000 human genes cause a specific disease or phenotype.
[0006] Due at least in part to their inflexible nature and lack of data concerning genes and phenotypes, some existing genetic testing systems exhibit limitations in application and / or inaccuracies / inabilities in linking pathogenic variants to particular diseases. For example, becauseAttorney Docket No. IP-2865-PCT 2 PCT Patent Applicationsome existing genetic testing systems or pathogenicity prediction models are limited to determining pathogenicity and / or phenotypes for human genes that have been clinically annotated, some systems or models inaccurately (or cannot) determine correlations between variant nucleotides in a genomic sample and diseases. Indeed, without determining whether a particular gene impacted by a pathogenic variant is associated with phenotypes observed in organisms (e.g., human patients) with a particular disease or other particular phenotype, existing systems or models cannot accurately determine relationships between the variant and the particular disease or other phenotype.
[0007] These, along with additional problems and issues exist in existing sequencing systems.SUMMARY
[0008] This disclosure describes embodiments of methods, non-transitory computer-readable media, and systems that utilize a machine-learning model to generate sample-specific predictions that variant genes or variant nucleotides affect or cause one or more of the sample’s phenotypes. In particular, the disclosed systems can utilize a trained variant-to-phenotype machine-learning model to process (a) gene-level features, (b) gene-to-phenotype scores from a gene embedding neural network, or (c) other features to generate variant-to-phenotype scores indicating respective probabilities of variant nucleotides affecting expression of phenotypes in an organism from which a genomic sample has been extracted. To train such a variant-to-phenotype machine-learning model, the disclosed systems can implement two training stages respectively utilizing (i) genomic samples and clinical data from a cohort of organisms diagnosed with genetic diseases or phenotypes and (ii) genomic samples of a cohort of organisms not diagnosed with one or more target genetic diseases. Further, the disclosed systems can determine whether a sample’s variant nucleotides affect expression of the sample’s phenotype(s) based on the variant-to-phenotype scores generated for the variant nucleotides.
[0009] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.BRIEF DESCRIPTION OF THE DRAWINGS
[0010] The detailed description refers to the drawings briefly described below.
[0011] FIG. 1 illustrates an environment in which a variant-to-phenotype prediction system can operate in accordance with one or more embodiments of the present disclosure.Attorney Docket No. IP-2865-PCT 3 PCT Patent Application
[0012] FIG. 2 illustrates an overview of the variant-to-phenotype prediction system generating a patient genetic report for a genomic sample in accordance with one or more embodiments of the present disclosure.
[0013] FIG. 3 illustrates the variant-to-phenotype prediction system generating variant-to- phenotype scores and identifying variant nucleotides affecting expression of one or more phenotypes (referred to herein as “phenotype-affecting variant nucleotides”) for a genomic sample in accordance with one or more embodiments of the present disclosure.
[0014] FIG. 4 illustrates an overview of training a gene embedding neural network to generate a gene-to-phenotype matrix in accordance with one or more embodiments of the present disclosure.
[0015] FIG. 5 illustrates the variant-to-phenotype prediction system determining gene-to- phenotype scores for individual genes in relation to a set of phenotype labels in accordance with one or more embodiments of the present disclosure.
[0016] FIG. 6 illustrates the variant-to-phenotype prediction system identifying a phenotypeaffecting variant nucleotide based on a ranking of variant-to-phenotype scores in accordance with one or more embodiments of the present disclosure.
[0017] FIG. 7 illustrates the variant-to-phenotype prediction system providing a chatbot user interface for interacting with a patient genetic report and clinical information in accordance with one or more embodiments of the present disclosure.
[0018] FIG. 8 illustrates the variant-to-phenotype prediction system training a variant-to- phenotype machine-learning model using a two-stage training process in accordance with one or more embodiments of the present disclosure.
[0019] FIG. 9A-9B illustrate comparative experimental results of identifying diagnostic variants associated with clinical phenotypes utilizing (i) an existing sequencing system and (ii) the variant-to-phenotype prediction system in accordance with one or more embodiments of the present disclosure.
[0020] FIG. 10 illustrates comparative experimental results of identifying diagnostic variants associated with clinical phenotypes utilizing the variant-to-phenotype prediction system to analyze (i) individual genomic samples and (ii) individual genomic samples in comparison with genomic samples of related individuals in accordance with one or more embodiments of the present disclosure.
[0021] FIG. 11 illustrates an overview of determining a false discovery rate of identifying phenotype-affecting variant nucleotides associated with clinical phenotypes utilizing the variant- to-phenotype prediction system in accordance with one or more embodiments of the present disclosure.Attorney Docket No. IP-2865-PCT 4 PCT Patent Application
[0022] FIGS. 12 and 13 illustrate comparative experimental results of identifying phenotypeaffecting variant nucleotides associated with human diseases utilizing (i) an existing genetic testing system and (ii) the variant-to-phenotype prediction system in accordance with one or more embodiments of the present disclosure.
[0023] FIG. 14 illustrates a flowchart of a series of acts for generating variant-to-phenotype scores utilizing a variant-to-phenotype machine-learning model in accordance with one or more embodiments of the present disclosure.
[0024] FIG. 15 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.DETAILED DESCRIPTION
[0025] This disclosure describes embodiments of a variant-to-phenotype prediction system that uses one or more machine-learning models to generate variant-to-phenotype scores indicating whether a genomic sample’s variant nucleotides affect expression of one or more phenotypes associated with an organism from which the genomic sample was extracted. By processing genelevel features for genes in a reference genome, gene-to-phenotype scores for such genes, and / or variant-level features for the genomic sample’s variant nucleotides, a disclosed machine-learning model can generate variant-to-phenotype scores indicating respective probabilities of variant nucleotides from such genes affecting phenotype expression. The disclosed variant-to-phenotype prediction system can further generate a patient genetic report for the genomic sample that includes such variant-to-phenotype scores, corresponding variants, and / or clinical information concerning the subject organism’s sequenced genome or targeted genes. In some cases, the disclosed variant- to-phenotype prediction system can also leverage the patient genetic report to provide a chatbot user interface through which a user can input questions and receive answers concerning a subject organism’s sequenced genome or targeted genes.
[0026] To determine variants that affect phenotype expression for a subject organism (e.g., a patient), in some embodiments, the disclosed variant-to-phenotype prediction system (i) processes variant nucleotides identified from a sequencing data file (e.g., a variant call file) for the organism’s genomic sample and corresponding to a set of genes given phenotype labels, (ii) accesses sampleindependent gene-level features for the identified genes, and (iii) processes the gene-level features and, in some cases, additional features and / or scores through a trained variant-to-phenotype machine-learning model. In addition to the gene-level features, in some embodiments, the variant- to-phenotype prediction system identifies variant genes from the set of genes given phenotype labels from a disease panel and further processes gene-to-phenotype scores (e.g., generated by a gene embedding neural network) for the set of genes, and / or variant-level features for the identifiedAttorney Docket No. IP-2865-PCT 5 PCT Patent Applicationvariant genes. By processing such features as described herein, the variant-to-phenotype machinelearning model can generate variant-to-phenotype scores indicating whether the genomic sample’s variant nucleotides affect expression of the organism’s phenotypes. Such variant-to-phenotype scores can be output in a gene-to-phenotype matrix that indicates a ranking of variant nucleotides most likely to affect phenotype expression of the subject organism. In some cases, the variant-to- phenotype prediction system also considers inheritance patterns and / or family analysis to rank or determine variant nucleotides affecting expression of one or more phenotypes (referred to herein as “phenotype-affecting variant nucleotides”) for the patient genetic report.
[0027] As mentioned, in addition to such a variant-to-phenotype machine-learning model, the variant-to-phenotype prediction system can utilize one or more machine-learning models to generate a patient genetic report that that includes such variant-to-phenotype scores and other information concerning an organism’s sequenced genome or targeted genes. Based on a listing of phenotype labels (e.g., Human Phenotype Ontology (HPO) terms, International Classification of Diseases (ICD) Clinical Modification (CM)) for an organism (e.g., human patient), an accompanying sequencing data file (e.g., a variant call file), and the variant-to-phenotype scores, the variant-to-phenotype prediction system can output an automated patient genetic report indicating phenotype-affecting variant nucleotides or phenotype-affecting genes for the organism. In addition to the patient genetic report and its contents, in some embodiments, the variant-to- phenotype prediction system utilizes a machine learning-based conversational agent (e.g., a large language model configured and trained to provide data inputs to and receive data outputs from a conversational agent) to provide, via a chatbot user interface, responses to user inputs (e.g., queries) with respect to the patient genetic report and clinical information related to the organism’s phenotypes.
[0028] In addition or in the alternative to identifying phenotype-affecting variant nucleotides, in one or more embodiments, the variant-to-phenotype prediction system provides metrics measuring the confidence in output results by quantifying the variant-to-phenotype machinelearning model’s (or other target model’s) false discovery rate (FDR) using a cohort of organisms diagnosed with genetic diseases or phenotypes (e.g., a rare disease cohort of input samples) to identify a true positive rate (TPR) and using a cohort of organisms not diagnosed with a target genetic disease or phenotype (e.g., a healthy cohort of input samples) to identify a false positive rate (FPR). The disclosed method for quantifying model sensitivity can, therefore, be applied to other models for their evaluation and comparison with the results of the disclosed variant-to- phenotype prediction system.
[0029] As mentioned, the variant-to-phenotype prediction system provides various technical advantages over existing genetic testing systems. For instance, the variant-to-phenotype predictionAttorney Docket No. IP-2865-PCT 6 PCT Patent Applicationsystem can identify variant nucleotides that affect phenotype expression with improved accuracy by utilizing a trained variant-to-phenotype machine-learning model to process a variety of features provided by a variety of databases. Such features include, for example, gene-level features, gene- to-phenotype scores, and / or variant-level features processed by the variant-to-phenotype machinelearning model; and analytical resources include, for example, various databases of genetic analysis and phenotype labels. As demonstrated by the experimental results presented herein (e.g., FIGS. 9A-9B and 12-13), the variant-to-phenotype scores generated by the improved machine-learning model exhibit improved performance in identifying variant nucleotides that affect phenotype expression relative to previous and existing models. By identifying variant nucleotides of a genomic sample from genes given phenotype labels and processing gene-level features and gene- to-phenotype scores, the variant-to-phenotype prediction system can use a first-of-its-kind machine-learning model to predict the effects of a sample’s variant nucleotides on gene function (e.g., phenotype expression caused by variant nucleotides within certain genes) and thereby exhibit increased diagnostic yield at a decreased false discovery rate relative to existing genetic testing systems.
[0030] Furthermore, by utilizing a variant-to-phenotype machine-learning model to process and analyze patterns across the aforementioned variety of features for large portions of a genome or a complete genome, the variant-to-phenotype prediction system can evaluate variant nucleotides and / or identify diagnostic variants with significantly increased accuracy and efficiency over conventional methods implemented by human experts in genomic analysis. Indeed, the human mind, including trained intellects of such experts in genomic analysis, could not practically or possibly determine the claimed variant-to-phenotypes scores for a set of genes spanning a primate, mammalian, or vertebrate genome (e.g., a human genome); a primate, mammalian, or vertebrate exome (e.g., a human exome); a single chromosome from such a vertebrate genome (e.g., a single human chromosome); or any other significant portion of such a vertebrate genome. At least in part due to this comparatively greater breadth of genome analysis and the approximately 14,000 - 15,000 human genes that have not been linked to particular phenotypes, in some cases, the variant- to-phenotype prediction system can accurately identify diagnostic variants not previously annotated by genomic analysis experts and / or clinical studies, as further demonstrated by the experimental results presented herein (e.g., FIGS. 9A-9B and 12-13), with a breadth, speed, accuracy, and / or complexity that humans cannot practically implement.
[0031] Beyond improved accuracy in identifying phenotype-affecting variant nucleotides, the variant-to-phenotype prediction system exhibits improved computing efficiency and saves consumable resources relative to existing genetic testing systems. As indicated above, some existing genetic testing systems implement targeted gene panels / assays with limited applicabilityAttorney Docket No. IP-2865-PCT 7 PCT Patent Applicationto phenotype expression, thereby requiring multiple sequencing runs and relatively more consumables (e.g., biochemical reagents) to generate data for relevant variant calls for a single sample and phenotype analysis of multiple samples from the same subject organism (e.g., human) or, alternatively, a larger sample volume (e.g., blood or sputum) from the subject organism. In contrast to such existing systems, the variant-to-phenotype prediction system can generate variant- to-phenotype scores for multiple variant nucleotides based on data related to a single genomic sample. Rather than requiring multiple genomic samples from an organism to perform multiple targeted gene panels of existing genetic testing systems, the variant-to-phenotype prediction system can process features derived from variant genes of a single sample from the organism to generate or access variant calls from a single whole genome sequencing (WGS) or whole exome sequencing (WES) run. By identifying variant nucleotides that affect expression of an organism’s phenotypes and processing gene-level features and gene-to-phenotype scores — without requiring multiple targeted gene panels / assays from multiple or larger-volume genomic samples extracted from the same organism — the disclosed variant-to-phenotype prediction system saves computer processing, time, and consumables relative to existing genetic testing systems.
[0032] In addition to improved computing and consumable efficiency, in some embodiments, the variant-to-phenotype prediction system improves efficiency by providing an end-to-end automated framework for identifying phenotype-affecting variant nucleotides for a target organism / patient from a single genomic sample. As indicated above, some existing genetic testing systems would need to rely on disconnected and disparate models or genetic phenotype databases to determine whether variant nucleotides detected from a sample’s nucleotide reads affect phenotype expression. By contrast, in some embodiments, the variant-to-phenotype prediction system provides a single pipeline that supports an end-to-end process of (i) sequencing a genomic sample at a sequencing device, (ii) determining variant-to-phenotype scores for variants identified by analysis of the genomic sample’s nucleotide reads generated by the sequencing device, and (iii) identifying one or more of the genomic sample’s variant nucleotides that affect phenotype expression based on variant-to-phenotype scores output by a variant-to-phenotype machinelearning model.
[0033] As noted above, the variant-to-phenotype system improves the accuracy of identifying phenotype-affecting variant nucleotides. To improve such accuracy, in some embodiments, the variant-to-phenotype prediction system performs a two-stage training of a variant-to-phenotype machine-learning model (e.g., as described in detail below in relation to FIG. 8). In the first stage, the variant-to-phenotype prediction system utilizes genomic samples and clinical data from a cohort of organisms diagnosed with genetic diseases or phenotypes (e.g., rare disease genome samples) to generate verifiable variant-to-phenotype scores. In the second stage, the variant-to-Attorney Docket No. IP-2865-PCT 8 PCT Patent Applicationphenotype prediction system utilizes genomic samples of a cohort of organisms not diagnosed with a target genetic disease or phenotype (e.g., heathy genome samples) to determine variant-to- phenotype-specific score distributions for further training of the model and attenuate the occurrence of false positives. By performing such a two-stage training of the variant-to-phenotype model to leverage a cohort of organisms not diagnosed with a target genetic disease or phenotype (e.g., heathy genome samples) from the general population for learning to identify disease-causing variants, the variant-to-phenotype prediction system increases the accuracy of the output variant- to-phenotype scores.
[0034] On top of improving computing efficiency and phenotype-affecting variant-identifying accuracy relative to existing genetic testing systems, the variant-to-phenotype prediction system improves the efficiency and accuracy of identifying phenotype-affecting variant nucleotides relative to alternative possible models. In particular, the variant-to-phenotype prediction system can process the aforementioned gene-to-phenotype scores with increased efficiency and accuracy by combining gene-specific values of a gene-to-phenotype matrix (e.g., as output by a gene embedding neural network) into a list of gene-to-phenotype scores for individual genes. By utilizing dissimilarity weights to intelligently combine gene-specific values across a gene-to- phenotype matrix, the variant-to-phenotype prediction system can generate consolidated gene-to- phenotype scores that account for redundancy of phenotype labels (e.g., HPO terms, ICD-10-CM Data) and assign higher gene-to-phenotype scores to unique phenotypes for a given genomic sample relative to a gene-agnostic approach to the variant-to-phenotype machine-learning model processing such gene-to-phenotype scores.
[0035] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the variant-to-phenotype prediction system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “genomic sample” (or “sample”) refers to a specimen, culture, or the like that is suspected of including a target nucleic acid. In some embodiments, the genomic sample comprises DNA, ribonucleic acid (RNA), peptide nucleic acid (PNA), locked nucleic acid (LNA), chimeric or hybrid forms of nucleic acids as targets. The genomic sample can likewise include any biological, clinical, surgical, agricultural-atmospheric, or aquatic-based specimen containing one or more nucleic acids. A genomic sample also includes any isolated or extracted nucleic acid sample from an organism, such a genomic DNA, fresh-frozen, or formalin-fixed paraffin-embedded nucleic acid specimen. In some cases, accordingly, a genomic sample can include a full genome or partial genome that is isolated or extracted (e.g., in whole or in part by a kit) from an organism and that is prepared to undergo sequencing or an assay in a sequencing device. A genomic sample can be from a single individual, a collection of nucleic acid samplesAttorney Docket No. IP-2865-PCT 9 PCT Patent Applicationfrom genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material, such as maternal and fetal DNA obtained from a maternal subject, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material can include nucleic acids obtained from a newborn, for example as typically used for newborn screening.
[0036] The genomic sample can include high molecular weight material, such as genomic DNA (gDNA). The genomic sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another implementation, low molecular weight material includes enzymatically or mechanically fragmented DNA. The genomic sample can include cell-free circulating DNA. In some implementations, the genomic sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some implementations, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some implementations, the genomic sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another implementation, the genomic sample can include nucleic acid molecules obtained from a non-mammalian source such as a plant, bacteria, virus, or fungus. In some implementations, the source of the nucleic acid molecules may be an archived or extinct sample or species.
[0037] In one or more embodiments, the variant-to-phenotype prediction system accesses and / or stores genomic data for a target genomic sample within one or more sequencing data files. As used herein, the term “sequencing data file” refers to a digital file that includes genetic sequencing information concerning genotype calls or nucleotide reads generated by one or more genomic sequencing procedures. Such sequencing information may include, for example, nucleotide reads, alignment and mapping information, nucleotide reads at one or more genomic coordinates, genotype calls, sequencing metrics, and so forth.
[0038] In some embodiments, for example, the variant-to-phenotype prediction system receives (or otherwise identifies) variant nucleotides and related metrics via a variant call format (VCF) file, or variant call file. As used herein, the term “variant call file” refers to a particular sequencing data file that comprises a text file format that contains information about variants at specific genomic coordinates. For instance, a variant call file can include meta-information lines, a header line, and data lines where each data line contains information about a single genotype call (e.g., a single variant).Attorney Docket No. IP-2865-PCT 10 PCT Patent Application
[0039] As further used herein, the term “variant nucleotide” (or sometimes simply “variant”) refers to a nucleotide within a sequence that varies from a reference nucleotide at a corresponding genomic coordinate. For example, a variant nucleotide includes a variation (e.g., deletion, insertion, translocation, inversion, or some other variation) in an organism’s chromosome or a variation to the nucleotide sequences of the organism’s chromosome. Relatedly, as used herein, the term “variant gene” refers to a gene within a genomic sample that includes at least one variant nucleotide relative to a reference sequence or reference genome.
[0040] As used herein, the term “reference genome” refers to a digital nucleic acid sequence assembled as a representative example (or representative examples) of genes and other genetic sequences of an organism. Regardless of the sequence length, in some cases, a reference genome represents an example set of genes or a set of nucleic acid sequences in a digital nucleic acid sequenced determined by scientists as representative of an organism of a particular species. For example, a linear human reference genome may be GRCh38 or other versions of reference genomes from the Genome Reference Consortium. As noted above, in some cases, a reference genome includes multi-base codes. As a further example, a reference genome may include a graph reference genome that includes both a linear reference genome and paths representing nucleic acid sequences from ancestral haplotypes, such as Illumina DRAGEN Graph Reference Genome hgl9.
[0041] Further, as used herein, the term “phenotype” refers to an observable characteristic or trait of an organism resulting from the interaction of the organism’s genotype with the environment. For example, phenotypes can include physical attributes (e.g., height, eye color, blood type) biochemical characteristics (e.g., blood glucose levels, enzyme activity levels, hormone concentrations, blood lipid profiles), or behavioral characteristics (e.g., sleep patterns, social tendencies, habitual predispositions). In some cases, a phenotype includes a particular genetic disease or condition (e.g., cystic fibrosis, Huntington’s disease, von Willebrand disease) or atypical symptom thereof.
[0042] Relatedly, as used herein, the term “phenotype label” refers to a given label for a particular phenotype (e.g., a clinical label as determined by a clinician and / or a consensus of systems and / or experts). For example, a phenotype label can include a given label from a genetic phenotype database, such as but not limited to standardized labels from the Online Mendelian Inheritance in Man (OMIM); the Orphanet Rare Disease Ontology (ORDO); the Human Phenotype Ontology (HPO); or the International Classification of Diseases (ICD) Clinical Modification (CM) (e.g., Tenth Revision known as ICD-10-CM).
[0043] Also, as used herein, the term “disease panel” refers to a curated listing of genes linked (e.g., according to results of clinical studies) to specific diseases or phenotypic traits. For example, a disease panel for a particular organism can include a set of genes given labels for one or moreAttorney Docket No. IP-2865-PCT 11 PCT Patent Applicationphenotypes determined for the particular organism. In some cases, the set of genes listed within a disease panel can be identified by clinical research, via an imputation model, or can otherwise be selected for the respective phenotypes to be analyzed according to the present disclosure (e.g., as described below in relation to FIG. 3.). Relatedly, as used herein, the term “disease panel imputation model” refers to a model configured to identify one or more genes associated with a particular set of one or more phenotypes. In some embodiments, for example, the variant-to- phenotype prediction system utilizes a disease panel imputation model to determine a set of target genes based on one or more phenotype labels determined for a subject organism (e.g., as described below in relation to FIG. 6).
[0044] As used herein, the term “gene-level feature” refers to a feature or a metric (or some other data) that measures, quantifies, or compares a particular gene from a reference genome without reference to expression of the particular gene. Accordingly, in some cases, gene-level features include metrics that are independent of individual genomic samples. Examples of such gene-level features include, but are not limited to, a probability that a genes is loss-of-function tolerant, a length of a gene (e.g., measured in nucleobases) within a reference genome or exome, an average level of messenger ribonucleic acid (mRNA) expression of a gene across different tissues, or a probability of a gene being recessive for inheritance. Further examples of gene-level features are described below with respect to FIG. 3.
[0045] As used herein, the term “variant-level feature” refers to a feature or a metric (or some other data) that measures, quantifies, or compares a variant nucleotide with respect to other variant nucleotides or reference nucleotides. Accordingly, in some cases, variant-level features include metrics that are specific to individual genomic samples, specific to individual variant nucleotide(s), or specific to a set of genomic samples that share or exhibit the same variant nucleotide(s). Examples of such variant-level features include, but are not limited to, a pathogenicity score, a splice-site score, an allele frequency, a de novo status of a variant nucleotide as a private variant, a genotype for a variant gene of variant genes of a genomic sample, or a loss-of-function status indicating that a variant nucleotide reduces or destroys a protein function. Further examples of variant-level features are described below with respect to FIG. 3.
[0046] As suggested above, the variant-to-phenotype prediction system can utilize one or more machine learning models to map variants of an organism to phenotypes exhibited by the organism. As used herein, the term “machine-learning model” refers to a computer algorithm or a collection of computer algorithms that automatically improve for a particular task through experience based on use of data. For example, a machine-learning model can utilize one or more learning techniques to improve in accuracy and / or effectiveness. Example machine-learning models include variousAttorney Docket No. IP-2865-PCT 12 PCT Patent Applicationtypes of decision trees, logistic regressions, linear regressions, random forests, support vector machines, Bayesian networks, or neural networks.
[0047] Relatedly, as used herein, the term “neural network” refers to a machine-learning model that can be trained and / or tuned based on inputs to determine classifications, metrics, or approximate unknown functions. For example, a neural network includes a model of interconnected artificial neurons (e.g., organized in layers) that communicate and leam to approximate complex functions and generate outputs (e.g., gene embeddings and / or gene-to- phenotype scores) based on a plurality of inputs provided to the neural network. In some cases, a neural network refers to an algorithm (or set of algorithms) that implements deep learning techniques to model high-level abstractions in data. For example, a neural network can include a convolutional neural network, a recurrent neural network (e.g., an LSTM), a graph neural network, a self-attention transformer neural network, or a generative adversarial neural network.
[0048] Along these lines, as used herein, the term “gene embedding” refers to a representation of a gene in a graphical space as generated, encoded, or extracted from a gene-to-gene graph. For example, a gene embedding includes a latent vector representation of a portion of a gene-to-gene graph. In some cases, a gene embedding includes a latent vector encoding features of a gene and its relationship to other genes from a gene-to-gene graph. For instance, a gene embedding includes or refers to an encoding of gene-level features, including or in addition to a gene’s name, from a feature vector, feature tensor, or other feature representation. In some cases, a gene embedding comprises a concatenation of one or more Boolean values indicating membership of an embedded gene within respective gene sets / groupings (e.g., where 0 indicates non-membership in a set / group and 1 indicates membership in the set / group), such as the gene sets in the MSigDB database. Alternatively, in some embodiments, a gene embedding is derived or generated from a concatenation of one or more Boolean values indicating membership of an embedded gene within respective gene sets / groupings (e.g., where 0 indicates non-membership in a set / group and 1 indicates membership in the set / group), such as the gene sets in the MSigDB database.
[0049] Relatedly, a “gene-to-gene graph” refers to a graph that maps genes (e.g., as nodes) to other genes using edges between gene nodes. In some cases, a gene-to-gene graph is multidimensional (e.g., three-dimensional), such as a gene co-expression graph (Gtex), a proteinprotein interaction graph (String), a sequence similarity graph, or a genetic interactions graph (CRISPRi gene pairs).
[0050] By contrast, the term “gene embedding neural network” refers to a neural network that generates gene embeddings and / or gene-to-phenotype scores. For example, a gene embedding neural network includes a graph neural network architecture with parameters trained using a two- stage training process to encode genetic relationships in gene embeddings and to further incorporateAttorney Docket No. IP-2865-PCT 13 PCT Patent Applicationphenotype labels to predict associations or correspondences of encoded genes to the phenotype labels (e.g., in the form of gene-to-phenotype scores). An example of a gene embedding neural network is described below with respect to FIG. 4.
[0051] Indeed, a “gene-to-phenotype score” refers to a score that indicates or reflects a probability of a gene corresponding to, being associated with, or causing the expression of a phenotype. For instance, a gene-to-phenotype score indicates a strength of a relationship between a given gene and a given phenotype, such as a rare disease or some other clinically annotated phenotype (e.g., a phenotype with a corresponding clinical label as determined by a clinician and / or by a consensus of systems and / or experts).
[0052] In some cases, the variant-to-phenotype prediction system uses a gene embedding neural network to generate a gene-to-phenotype matrix of gene-to-phenotype scores. As used herein, a “gene-to-phenotype matrix” refers to a matrix comprising a set of gene-to-phenotype scores for a set of genes. For example, a gene-to-phenotype matrix maps genes to phenotypes by indicating (e.g., using numbers or other visual markers), for each of a set of genes, a gene-to-phenotype score for each phenotype within a set of phenotypes. In some cases, a gene-to-phenotype matrix visually indicates a measure or a degree of correspondence of a gene to a phenotype by a color, a shade, and / or a size of an indicator within the gene-to-phenotype matrix.
[0053] In some embodiments, the variant-to-phenotype prediction system determines “dissimilarity weights” indicating respective measures of diversity between individual phenotypes represented within a given gene-to-phenotype matrix and applies the dissimilarity weights to the given gene-to-phenotype matrix to determine individual gene-to-phenotype scores (also referred to herein as “weighted gene-to-phenotype scores”) for respective individual genes of a respective set of genes represented within the given gene-to-phenotype matrix (e.g., as discussed below in relation to FIG. 5).
[0054] As used herein, the term “variant-to-phenotype machine-learning model” refers to a machine-learning model that determines links or relationships between variant nucleotides of variant genes and phenotypes (e.g., indicated by phenotype labels determined or otherwise selected for an organism). For example, a variant-to-phenotype machine-learning model determines scores that measure or quantify a degree to which a variant nucleotide of a gene affects an expression of a phenotype. A variant-to-phenotype machine-learning model can include, for example, a logistic regression model, a random forest model, or some other model architecture that processes one or more of the aforementioned gene-level features, variant-level features, gene-to-phenotype scores, or other related inputs to generate variant-to-phenotype scores indicating probabilities of variants of a sample / organism being associated with (affecting expression of) phenotypes exhibited by (or otherwise selected for) the sample / organism (e.g., as discussed below in relation to FIGS. 3 and 6).Attorney Docket No. IP-2865-PCT 14 PCT Patent Application
[0055] Relatedly, as used herein, the term “variant-to-phenotype score” refers to a score or a metric that represents or defines a relationship between a variant nucleotide and a phenotype (e.g., as indicated by a particular phenotype label). For example, a variant-to-phenotype score indicates a probability of a particular variant nucleotide of an organism corresponding to, being associated with, or causing the expression of a phenotype (e.g., a characteristic or disease) observed within, exhibited by, or otherwise determined for or assigned to the organism.
[0056] As further used herein, the term “diagnostic variant” refers to a variant nucleotide or variant gene that is diagnosed as corresponding to, or impacting the expression of, a particular phenotype. For example, a diagnostic variant includes a variant that affects the expression of a certain genetic condition or disease within an organism. Relatedly, as used herein, the term “causative variant” refers to a variant nucleotide or variant gene that is directly responsible for causing a disease or phenotype. While a diagnostic variant is identified as being associated with a particular disease or condition (e.g., based on variant-to-phenotype scores generated by embodiments of the variant-to-phenotype prediction system), a causative variant is understood to be directly causative of a given disease or phenotype (e.g., as supported by substantial clinical evidence).
[0057] As used herein, the term “conversational agent” refers to a system or application configured to engage in natural language interactions with a user. In some embodiments, for example, a conversational agent comprises a machine learning-based model trained to receive and respond to user inputs (e.g., questions, queries, requests for information). Relatedly, as used herein, the term “chatbot user interface” refers to an interface on a client device that receives user inputs and provides responses thereto, such as responses generated by a conversational agent (e.g., as discussed below in relation to FIG. 7).
[0058] As further used herein, the term “pathogenicity” refers to the ability or tendency of a biological molecule to contribute to disease within a host organism. In particular, pathogenicity can refer to the ability or tendency of a biological molecule to cause or lead to the susceptibility of disease within the host organism. For example, pathogenicity can refer to the ability or tendency of a variant nucleotide encoding an amino acid (or an amino acid itself) to cause a protein including the amino acid to function in a manner that causes disease within a host organism or leads to susceptibility of the host organism to disease. More specifically, in some cases, pathogenicity refers to the ability or tendency of a variant nucleotide or an amino-acid variant within a protein to change the function and / or structure of the protein such that the protein causes or leads to susceptibility to disease.
[0059] Relatedly, the term “pathogenicity score” refers to a score that is indicative of pathogenicity. In particular, a pathogenicity score can refer to a score that indicates thatAttorney Docket No. IP-2865-PCT 15 PCT Patent Applicationpathogenicity of a nucleotide or an amino acid within a target protein sequence. For instance, in some cases, a pathogenicity score includes a score indicating the pathogenicity of a variant nucleotide or an amino-acid variant included in the target protein sequence. In some instances, a pathogenicity score includes a numerical value where a relatively higher value indicates relatively higher pathogenicity and a relatively lower value indicates relatively lower pathogenicity or vice versa. In some embodiments, a pathogenicity score provides a direct measure of pathogenicity (e.g., the score indicates the level of pathogenicity). In some instances, however, a pathogenicity score provides an indirect measure of pathogenicity, such as by measuring some other characteristic related to pathogenicity (e.g., the depletion of observed variants).
[0060] By contrast, as used herein, the term “splice-site score” refers to a measure of splicing associated with a nucleotide sequence. In particular, a splice-site score can refer to a value or set of values that indicates splicing associated with a nucleotide sequence. For instance, a splice-site score can indicate a probability of one or more of variant nucleotides being part of a splice site for one or more nucleotide sequences encoding precursor messenger ribonucleic acid (pre-mRNA) or other RNA.
[0061] The following paragraphs describe the variant-to-phenotype prediction system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a variant-to- phenotype prediction system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes different systems and devices connected via a network 110, including server device(s) 108, a client device 102, a database 120, and a sequencing system 116. While FIG. 1 shows an embodiment of the variant-to-phenotype prediction system 106, this disclosure describes alternative embodiments and configurations below.
[0062] As shown in FIG. 1, the server device(s) 108, the database 120, the client device 102, and the sequencing system 116 can communicate with each other via the network 110. The network 110 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 15.
[0063] As indicated by FIG. 1, the sequencing system 116 comprises a device for sequencing a nucleic acid polymer (e.g., a sequencing device). In some embodiments, the sequencing system 116 uses a sequencing device to analyze nucleic acid segments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device. More particularly, the sequencing system 116 receives and analyzes (e.g., via the sequencing device), within nucleotide- sample slides (e.g., flow cells), nucleic acid sequences extracted from samples. In one or more embodiments, the sequencing system 116 utilizes sequencing-by-synthesis (SBS) to sequenceAttorney Docket No. IP-2865-PCT 16 PCT Patent Applicationnucleic acid polymers into nucleotide reads. As shown, the sequencing system 116 can also analyze nucleotide reads corresponding to a genomic sample to generate a variant call file 118 (represented by “VCF”). Indeed, the sequencing system 116 can include a computing device, such as a server (e.g., an edge server), to analyze or receive sequencing data from a sequencing device to generate the variant call file 118 comprising variant calls and other variant nucleotide data. In addition or in the alternative to communicating across the network 110, in some embodiments, the sequencing system 116 bypasses the network 110 and communicates directly with the client device 102.
[0064] As further indicated by FIG. 1, the server device(s) 108 may generate, receive, analyze, store, and transmit digital data, such as data for generating variant-to-phenotype scores 114, and / or training a variant-to-phenotype machine-learning model 112 (e.g., utilizing data from the sequencing system 116 or elsewhere). As shown, the server device(s) 108 can house all or part of the variant-to-phenotype prediction system 106, including the variant-to-phenotype machinelearning model 112. Indeed, using the variant-to-phenotype machine-learning model 112, the server device(s) 108 can generate variant-to-phenotype scores 114, where the variant-to-phenotype scores 114 can be used to determine diagnostic variants in combination with sequencing data (e.g., generated by, and received from, the sequencing system 116).
[0065] As further illustrated in FIG. 1, the server device(s) 108 may generate, receive, store, and transmit digital data, such as data for generating the variant-to-phenotype scores 114. For example, the server device(s) 108 may house all or part of the variant-to-phenotype prediction system 106, including the variant-to-phenotype machine-learning model 112 for generating the variant-to-phenotype scores 114 based on gene-level features 122, variant-level features 124, gene- to-phenotype scores 126 (e.g., generated using a gene embedding neural network and / or stored within the database 120), and / or other input data described herein (e.g., in relation to FIG. 2), and the variant call fde 118 (e.g., generated by, and received from, the sequencing system 116). Indeed, as shown in FIG. 1 , the sequencing system 116 may send (and the server device(s) 108 may receive) call data in the form of the variant call file 118 or “VCF.” The server device(s) 108 may also communicate with the client device 102. In particular, the server device(s) 108 can send data to the client device 102, including the variant call file 118 or other information indicating variant nucleotides, the gene-level features 122, the variant-level features 124, the gene-to-phenotype scores 126, the variant-to-phenotype scores 114, diagnostic variants, or other data.
[0066] In some embodiments, the server device(s) 108 comprise distributed collections of servers where the server device(s) 108 include a number of server devices distributed across the network 110 and located in the same or different physical locations. Further, the server device(s) 108 can each comprise a content server, an application server, a communication server, a webhosting server, or another type of server.Attorney Docket No. IP-2865-PCT 17 PCT Patent Application
[0067] As mentioned, and as illustrated in FIG. 1, the variant-to-phenotype prediction system 106 analyzes data — such as variant nucleotides from the variant call file 118 output by the sequencing system 116 along with gene-level features (e.g., the gene-level features 122), variantlevel features (e.g., the variant-level features 124), gene-to-phenotype scores (e.g., as described below in relation to FIGS. 4-5) and / or other data — to generate the variant-to-phenotype scores 114 to link variant nucleotides to phenotypes (e.g., diseases). In some cases, the variant-to-phenotype prediction system 106 includes a gene embedding neural network in addition to the variant-to- phenotype machine-learning model 112. In some embodiments, for example, the variant-to- phenotype prediction system 106 utilizes a gene embedding neural network to generate gene-to- phenotype scores. Additionally or alternatively, in some embodiments, the variant-to-phenotype prediction system 106 receives gene-to-phenotype scores for one or more genes from the database 120.
[0068] As further illustrated and indicated in FIG. 1, the client device 102 can generate, store, receive, and send digital data. In particular, the client device 102 can receive data representing the gene-level features 122, the variant-level features 124, the gene-to-phenotype scores 126, the variant-to-phenotype scores 114, and / or other data. Furthermore, the client device 102 may communicate with the server device(s) 108 or the sequencing system 116 to receive the variant call file 118. The client device 102 can accordingly present or display information pertaining to generating the variant-to-phenotype scores 114 within a graphical user interface to a user associated with the client device 102. The client device 102 illustrated in FIG. 1 may comprise various types of client devices. For example, in some embodiments, the client device 102 includes non-mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 102 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 102 are discussed below with respect to FIG. 15.
[0069] As further illustrated in FIG. 1, the client device 102 includes a client application 104. The client application 104 may be a web application or a native application stored and executed on the client device 102 (e.g., a mobile application, desktop application). The client application 104 can include instructions that (when executed) cause the client device 102 to receive data from the variant-to-phenotype prediction system 106 and present, for display at the client device 102, data pertaining to the variant-to-phenotype scores 114 and / or variant diagnostics (e.g., as further described below in relation to FIG. 7).
[0070] As further illustrated in FIG. 1, in some embodiments, the variant-to-phenotype prediction system 106 may be located on the client device 102 as part of the client application 104 or on the sequencing system 116 (e.g., downloaded in whole or in part). Accordingly, in someAttorney Docket No. IP-2865-PCT 18 PCT Patent Applicationembodiments, the variant-to-phenotype prediction system 106 is implemented by (e.g., located entirely or in part) on the client device 102. In yet other embodiments, the variant-to-phenotype prediction system 106 is implemented by one or more other components of the computing system 100, such as the sequencing system 116. In particular, the variant-to-phenotype prediction system 106 can be implemented in a variety of different ways across the server device(s), the network 110, the client device 102, and the sequencing system 116. For example, the variant-to-phenotype prediction system 106 can be downloaded from the server device(s) 108 to the client device 102 and / or to the sequencing system 116 where all or part of the functionality of the variant-to- phenotype prediction system 106 is performed at each respective device within the computing system 100.
[0071] As further illustrated in FIG. 1, the computing system 100 includes the database 120. The database 120 can store information, such as variant call fdes (e.g., the variant call fde 118), gene-to-gene graphs, gene embeddings, the gene-level features 122, the variant-level features 124, the gene-to-phenotype scores 126, and / or other data described herein. In some embodiments, the server device(s) 108, the client device 102, and / or the sequencing system 116 communicate with the database 120 (e.g., via the network 110) to store and / or access information. In some cases, the database 120 also stores one or more models, such as the aforementioned gene embedding neural network and / or the variant-to-phenotype machine-learning model 112.
[0072] Though FIG. 1 illustrates the components of the computing system 100 communicating via the network 110, in certain implementations, the components of computing system 100 can also communicate directly with each other, bypassing the network 110. For instance, and as previously mentioned, in some implementations, the client device 102 communicates directly with the sequencing system 116. Additionally, in some embodiments, the client device 102 communicates directly with the variant-to-phenotype prediction system 106. Moreover, the variant-to-phenotype prediction system 106 can access one or more databases housed on or accessed by the server device(s) 108 or elsewhere in the computing system 100.
[0073] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 uses one or more machine-learning models to generate variant-to-phenotype scores based on a variety of inputs in relation to an organism and / or a genomic sample thereof. For example, FIG. 2 provides an overview of the variant-to-phenotype prediction system 106 generating a patient genetic report 216 for a subject organism (e.g., a patient) based on one or more of a variety of input features, metrics, and data sources.
[0074] As shown in FIG. 2, the variant-to-phenotype prediction system 106 generates, identifies, or otherwise accepts as input a sequencing data file 202, such as a variant call format (VCF) file, corresponding to a genomic sample for an organism (e.g., a human patient). In additionAttorney Docket No. IP-2865-PCT 19 PCT Patent Applicationto the sequencing data file 202, the variant-to-phenotype prediction system 106 can identify (or receive) relevant information for the organism from a variety of sources, such as shown in FIG. 2, and process the respective information using a variant-to-phenotype machine-learning model 204 to generate variant-to-phenotype scores and, in some cases, determine one or more diagnostic variants to include in the patient genetic report 216. Also, in some embodiments, having generated the patient genetic report 216, the variant-to-phenotype prediction system 106 can provide automated responses to user inputs regarding the patient genetic report 216 and genetic information related thereto (e.g., clinical data) via a conversational agent 218 (e.g., as described below in relation to FIG. 7).
[0075] As illustrated in FIG. 2, the variant-to-phenotype machine-learning model 204 can be configured and trained to process inputs from one or more of numerous sources to generate variant- to-phenotype scores indicating probabilities of respective variant nucleotides — identified within the sequencing data file 202 — affecting expression of one or more phenotypes or diseases determined for the organism / patient (e.g., as described below in relation to FIG. 3). For instance, the variant-to-phenotype machine-learning model 204 processes one or more gene-level features 212 for a set of genes identified for the organism, such as one or more genes associated with a disease / phenotype of disease / phenotypes 208 of the organism or otherwise selected for / assigned to the organism. In some embodiments, in addition to the one or more gene-level features 212, the variant-to-phenotype prediction system 106 uses the variant-to-phenotype machine-learning model 204 to process one or more variant-level features 214 for variant nucleotides within the set of genes, as indicated within the sequencing data file 202. Moreover, in some embodiments, the variant-to- phenotype prediction system 106 can also utilize a gene embedding neural network 206 to generate gene-to-phenotype scores 207 (e.g., as described below in relation to FIGS. 4-5) representing or indicating respective probabilities of the genes corresponding to, being associated with, or causing the expression of phenotypes (e.g., phenotypes of the disease / phenotypes 208).
[0076] As also shown in FIG. 2, in some embodiments, the variant-to-phenotype prediction system 106 utilizes sequencing information and clinical results from one or more population sequencing cohorts 210 to train the variant-to-phenotype machine-learning model 204 and / or to determine additional features / metrics to be process by the variant-to-phenotype machine-learning model 204 to generate variant-to-phenotype scores (e.g., as described below in relation to FIG. 8).
[0077] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 identifies one or more variant nucleotides corresponding to, or impacting the expression of, an organism’s disease or phenotype(s). For example, FIG. 3 illustrates the variant-to-phenotype prediction system 106 determining one or more diagnostic variants 330 corresponding to, orAttorney Docket No. IP-2865-PCT 20 PCT Patent Applicationimpacting the expression of, a set of phenotype labels 304 determined for an organism in accordance with one or more embodiments.
[0078] As shown in FIG. 3, the variant-to-phenotype prediction system 106 identifies (or receives) a sequencing data file 302 (e.g., a VCF file) for an organism, the sequencing data file 302 indicating variant nucleotides 308 identified within a genomic sample for the organism. In some cases, the variant-to-phenotype prediction system 106 identifies, determines, or otherwise assigns the set of phenotype labels 304 for the organism and identifies a set of genes 306 associated with the set of phenotype labels 304. In some embodiments, for example, the variant-to-phenotype prediction system 106 identifies the set of genes 306 from genes given labels for certain phenotypes as a result of clinical and other studies.
[0079] Moreover, in some embodiments, the variant-to-phenotype prediction system 106 processes the set of phenotype labels 304 utilizing a disease panel imputation model to identify the set of genes 306 (e.g., as discussed below in relation to FIG. 6). Alternatively, in one or more embodiments, the variant-to-phenotype prediction system 106 can include any or all genes within the set of genes 306 without reference to any particular phenotype (e.g., user-selected genes or all genes having variant nucleotides within a given genomic region).
[0080] As further illustrated in FIG. 3, having identified the set of genes 306 for the organism, the variant-to-phenotype prediction system 106 accesses (or otherwise determines) gene-level features 316 for one or more genes of the set of genes 306 in relation to a reference genome of the organism. Based at least on the gene-level features 316, the variant-to-phenotype prediction system 106 utilizes a variant-to-phenotype machine-learning model 320 to generate variant-to-phenotype scores 322 linking one or more of the variant nucleotides 308 to phenotypes of the set of phenotype labels 304.
[0081] Examples of gene-level features for a given gene include, but are not limited to, a length of the given gene (e.g., measured by number of nucleobases, span of genomic coordinates); a probability of the given gene being Loss-of-Function Intolerant (pLI); a haploinsufficiency score (e.g., a population-based probability of a loss of one allele of the given gene, a Haploinsufficiency Prediction (HIPred) computational score, or a probability for a Growth Hormone Insensitivity Syndrome (GHIS) haploinsufficiency); a population-based probability of a gene being recessive (e.g., from gnomAD’s probability of being recessive (pRec)); a gene indispensability score indicating a degree to which the given gene is essential to cellular or organismal functions (e.g., a gene metric from Gene Variation Intolerance Rank (GeVIR)); a likelihood of tolerance score indicating a degree to which the given gene is tolerant to mutations (e.g. , a LoT_Z score, a Mutation Tolerance Score (MTR), Residual Variation Intolerance Score (RVIS)); or a gene expression levelAttorney Docket No. IP-2865-PCT 21 PCT Patent Applicationfor the given gene across a tissue or a set of tissues (e.g., mean and / or standard deviations as measured by the Genotype-Tissue Expression (GTEx) project).
[0082] As further discussed below, the variant-to-phenotype prediction system 106 can access and / or provide additional inputs to the variant-to-phenotype machine-learning model 320, including, but not limited to, variant-level features 318, gene-to-phenotype scores 314, and other features not categorized as the gene-level features 316, variant-level features 318, or the gene-to- phenotype scores 314. For example, such other features include but are not limited to a count (e.g., a pPatho metric) or ratio (e.g., a nPatho metric) of clinically identified pathogenic variants in an exon where a given variant nucleotide is located, an indication (e.g., an in canonical metric) of whether a given variant nucleotide is located in a reference transcript (e.g., a canonical transcript from the ENSEMBL database of the European Bioinformatics Institute) for the gene of the given variant nucleotide, a count (e.g., an AACpatho metric or an AACbenighn metric) or proportion (e.g., a pAAPath metric) of clinically identified pathogenic or benign variants in an amino acid of a given variant, a count (e.g., a nPathoNS metric, a nPathoSNV metric, or a nPathoCat metric) or proportion (e.g., a pPathoNS metric, a pPathSNV metric, or a pPathoCat metric) of clinically identified pathogenic stop-gain variants and / or pathogenic missense variants in an exon corresponding to a given variant nucleotide, an indication (e.g., a last exon metric) of whether a given variant nucleotide corresponds to a final exon in a respective gene sequence, or an indication (e.g., a PTC50 metric) of whether a given variant nucleotide corresponds to a location within a threshold number of nucleobases (e.g., 50 nucleobases) from an exon end.
[0083] As also shown in FIG. 3, in some embodiments, the variant-to-phenotype prediction system 106 identifies, from the set of genes 306, one or more variant genes of the variant genes 310 comprising variant nucleotides from the variant nucleotides 308 indicated within the sequencing data file 302. Having identified the variant genes 310, in some embodiments, the variant-to-phenotype prediction system 106 accesses gene-to-phenotype scores 314 generated by a gene embedding neural network 312, the gene-to-phenotype scores 314 indicating respective probabilities of the variant genes 310 being associated with respective phenotypes of the set of phenotype labels 304.
[0084] The gene-to-phenotype scores 314 can include one or more types of scores that are specific to a phenotype or disease or that have been weighted for phenotype labels. In some embodiments, for example, the gene-to-phenotype scores 314 for respective genes of the variant genes 310 can include one or more of (i) a phenotype-specific gene-to-phenotype score for an individual phenotype label of the set of phenotype labels 304 (e.g., Ghpo3), (ii) a disease-specific gene-to-phenotype score for a disease associated with multiple phenotype labels (e.g., Gspec), or (iii) a weighted gene-to-phenotype score corresponding to multiple phenotype labels of the set ofAttorney Docket No. IP-2865-PCT 22 PCT Patent Applicationphenotype labels 304 (e.g., Ggroup). The different types of such gene-to-phenotype scores are described further below in relation to FIG. 5.
[0085] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 also access (or otherwise determines) variant-level features 318 for the variant nucleotides 308 indicated within the sequencing data file 302. In some cases, the variant-level features 318 are limited to variant nucleotides located within the set of genes 306. As shown in FIG. 3, the variant- to-phenotype prediction system 106 can process the variant-level features 318 in addition to the gene-level features 316 and / or the gene-to-phenotype scores 314 using the variant-to-phenotype machine-learning model 320 to generate the variant-to-phenotype scores 322 for the organism. Based on the variant-to-phenotype scores 322, the variant-to-phenotype prediction system 106 determines the one or more diagnostic variants 330 for the organism / patient (e.g., as further described below in relation to FIG. 6).
[0086] Examples of variant-level features include, but are not limited to, a pathogenicity score (e.g., from a Primate Al model) indicating a degree to which variant proteins corresponding to variant nucleotides are benign or pathogenic; a splice-site score (e.g., from a SpliceAI model) indicating a probability that the variant nucleotides are part of a splice site or a non-splice site for pre-messenger RNA or other RNA; an allele frequency corresponding to the variant nucleotides (e.g., from a gnomADg AF model, 1000 Genomes Project, UK BioBank, TOPMed); a genotype for a variant gene of a genomic sample; a loss-of-function status indicating that a variant nucleotide reduces or destroys protein function (e.g., from gnomAD, Clinical Variant (ClinVar) database of the National Center for Biotechnology Information (NCBI), Human Knockout Project from the MacArthur Lab); a de novo status of the variant nucleotides as private variants; a pathogenicity status as indicated by clinical data (e.g., as indicated in the ClinVar database); an enhancer score (e.g., from an Enformer model) indicating a probability of the variant nucleotides being part of an enhancer gene sequence that regulates gene expression or, alternatively, chromatin state or binding signals; a promoter sequence score (e.g., from a PromoterAI model, an Xpresso model, or a CRMnet model) indicating a probability of the variant nucleotides being part of a promoter sequence that regulates gene expression; or an untranslated region score (e.g., from an EPInformer model) indicating a degree to which a variant nucleotide within untranslated regions of respective genes are benign or pathogenic. In some implementations, the variant-to-phenotype prediction system 106 uses Primate Artificial Intelligence (PrimateAI), Primate Artificial Intelligence 3D (PrimateAI-3D), Combined Annotation-Dependent Depletion (CADD), Rare Exome Variant Ensemble Learner (REVEL), ClinPred, Mendelian Clinically Applicable Pathogenicity (M-CAP), and / or other pathogenicity scoring systems.Attorney Docket No. IP-2865-PCT 23 PCT Patent Application
[0087] As further illustrated in FIG. 3, the variant-to-phenotype prediction system 106 can access additional information for the organism / patient as a basis for determining the one or more diagnostic variants 330. For instance, the variant-to-phenotype prediction system 106 can access and process an inheritance pattern 324 indicating a zygosity of the genomic sample for respective variants of the variant nucleotides 308 (e.g., as indicated in the sequencing data file 302) and / or a family analysis 326 for the organism / patient. In some cases, for example, the variant-to-phenotype prediction system 106 utilizes the inheritance pattern 324 and / or the family analysis 326 to identify recessive or dominant occurrences of the variant genes 310 and weighs the variant genes 310 accordingly, in addition to the variant-to-phenotype scores 322, when identifying the one or more diagnostic variants 330 amongst the variant nucleotides 308.
[0088] As indicated above, the variant-to-phenotype prediction system 106 can determine links or relationships between variant nucleotides and observed / exhibited phenotypes using the machine learning techniques described herein. In some embodiments, as at least one basis for determining that a variant affects phenotype expression, the variant-to-phenotype prediction system 106 can also determine relationships between genes and phenotypes (e.g., in the form of gene-to-phenotype scores for input to a variant-to-phenotype machine-learning model) using a gene embedding neural network. For this purpose, in certain cases, the variant-to-phenotype prediction system 106 can train or build a gene embedding neural network using a two-stage training process. FIG. 4 provides an overview of the two-stage training process for the gene embedding neural network in accordance with one or more embodiments.
[0089] As illustrated in FIG. 4, the variant-to-phenotype prediction system 106 identifies, receives, or accesses gene-to-gene graphs 402. In particular, the variant-to-phenotype prediction system 106 accesses gene-to-gene graphs 402 that visually represent relationships between genes. For instance, the variant-to-phenotype prediction system 106 accesses an external database storing gene-to-gene graphs, such as a gene co-expression graph (Gtex), a protein-protein interaction graph (String), a sequence similarity graph, or a genetic interactions graph (CRISPRi gene pairs).
[0090] As further illustrated in FIG. 4, the phenotype prediction system 106 performs unsupervised training 404. More specifically, as the first stage of the two-stage training process for learning parameters of a gene embedding neural network, the variant-to-phenotype prediction system 106 utilizes an unsupervised process (without ground truth labels) to leam a graph structure of the gene-to-gene graphs 402 (e.g., of one graph or more than one graph). For instance, the variant-to-phenotype prediction system 106 inputs one or more of the gene-to-gene graphs 402 (or a combination of two or more of the graphs) into a gene embedding neural network (e.g., gene embedding neural network 312). Based on the one or more gene-to-gene graphs of the gene-to- gene graphs 402, the gene embedding neural network generates a predicted gene embedding, andAttorney Docket No. IP-2865-PCT 24 PCT Patent Applicationthe variant-to-phenotype prediction system 106 adjusts network parameters (e.g., weights and biases) based on the predicted gene embedding. As shown, the variant-to-phenotype prediction system 106 repeats the process over training iterations so that eventually the gene embedding neural network leams parameters that accurately model or encode the structure of the input encoding the gene-to-gene graphs 402. Accordingly, the variant-to-phenotype prediction system 106 generates gene embeddings 406 that represent or encode the genes within a latent space, where the distance between the gene embeddings represents the relationships between the genes (e.g., as indicated by the edges of the gene-to-gene graphs 402).
[0091] As further illustrated in FIG. 4, the variant-to-phenotype prediction system 106 performs a supervised fine-tuning 408 (with ground truth labels for phenotypes) as the second stage within the two-stage training process. To elaborate, the variant-to-phenotype prediction system 106 fine-tunes the parameters learned from the unsupervised training 404 (e.g., the first stage) to account for phenotype labels. For instance, the variant-to-phenotype prediction system 106 links genes to phenotypes by encoding or scoring gene-to-phenotype relationships through the supervised fine-tuning. Specifically, the variant-to-phenotype prediction system 106 tunes network parameters on an attention-based classification task to classify genes into respective phenotype categories or labels. For example, for a gene that has a clinically annotated phenotype label, the variant-to-phenotype prediction system 106 leams its relationships to nearby genes (e.g., given by the gene embeddings 406) and further links the nearby genes to the phenotype label (e.g., where genes farther from the labeled gene have weaker links than closer genes).
[0092] Accordingly, the variant-to-phenotype prediction system 106 trains the gene embedding neural network to generate a gene-to-phenotype matrix 410 that represents or includes gene-to-phenotype scores. Indeed, for each gene in the matrix, the variant-to-phenotype prediction system 106 determines a score for each of a set of phenotype labels. As shown, higher gene-to- phenotype scores are represented by larger dots and lower gene-to-phenotypes scores are represented by smaller dots (where some dots are invisible because the scores are negligibly small). Furthermore, in some embodiments, the variant-to-phenotype prediction system 106 trains and / or implements a gene embedding neural network to generate gene-to-phenotype scores as described in “LINKING HUMAN GENES TO CLINICAL PHENOTYPES USING GRAPH NEURAL NETWORKS,” Int’l Patent Application No. PCT / US2024 / 032170, filed June 2, 2024, which is incorporated herein by reference in its entirety.
[0093] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 applies dissimilarity weights to a gene-to-phenotype matrix to determine one or more weighted gene-to-phenotype scores indicating respective probabilities of individual genes being associated with a set of phenotype labels determined for an organism. For example, FIG. 5 illustrates theAttorney Docket No. IP-2865-PCT 25 PCT Patent Applicationvariant-to-phenotype prediction system 106 determining individual weighted gene-to-phenotype scores 512 for respective individual genes of variant genes 504, each individual weighted gene-to- phenotype score of the weighted gene-to-phenotype scores 512 indicating a probability of an individual gene of the variant genes 504 being associated with phenotype labels 502 determined for an organism (e.g., an patient having the variant genes 504).
[0094] As shown in FIG. 5, the variant-to-phenotype prediction system 106 identifies the variant genes 504 for an organism and, in accordance with one or more embodiments, generates or otherwise accesses a gene-to-phenotype matrix 505 indicating phenotype-specific gene-to- phenotype scores for respective phenotypes of the phenotype labels 502. As illustrated, for example, the gene-to-phenotype matrix 505 includes a gene-to-phenotype score individually associated with one of the phenotype labels 502 and one of the variant genes 504. In this example, the gene-to-phenotype matrix 505 comprises a total of nine gene-to-phenotype scores (3 phenotype labels x 3 variant genes).
[0095] As also shown in FIG. 5, the variant-to-phenotype prediction system 106 identifies, determines, assigns, or otherwise selects the phenotype labels 502 for a subject organism / patient and determines dissimilarity weights 510 (e.g., within a dissimilarity matrix 506). The dissimilarity weights 510 indicate respective measures of diversity between the phenotype labels 502. As illustrated, for example, the phenotypes of a hearing abnormality and a mile neurosensory hearing impairment are relatively similar to one another in comparison with the phenotype of a visual impairment. Accordingly, in the illustrated example, the variant-to-phenotype prediction system 106 determines dissimilarity weights 510 of 0.26 and 0.27 for the former phenotypes (labelled “HPO1” and “HPO2”) and 0.47 for the latter phenotype (labelled “HPO3”).
[0096] As further illustrated in FIG. 5, the variant-to-phenotype prediction system 106 combines phenotype-specific gene-to-phenotype scores for each variant gene represented in the gene-to-phenotype matrix 505 while applying the respective dissimilarity weights 510 thereto to generate the weighted gene-to-phenotype scores 512 for individual genes of the variant genes 504. By applying the dissimilarity weights 510 to the gene-to-phenotype matrix 505 as shown, the variant-to-phenotype prediction system 106 generates combined scores (that is, the weighted gene- to-phenotype scores 512) that account for redundancies in the phenotype labels 502 determined for the organism. But in combining such scores in the gene-to-phenotype matrix 505, that the variant- to-phenotype prediction system 106 gives greater weight (e.g., assigns larger weights to) gene-to- phenotype scores for phenotype labels of relative diversity than gene-to-phenotype scores for relatively less diverse phenotype labels.
[0097] As indicated above, in some embodiments, the variant-to-phenotype prediction system 106 can identify a diagnostic variant based at least in part on variant-to-phenotype scores generatedAttorney Docket No. IP-2865-PCT 26 PCT Patent Applicationby a variant-to-phenotype machine-learning model for variant nucleotides of a genomic sample for an organism. For example, FIG. 6 illustrates the variant-to-phenotype prediction system 106 generates a ranking 624 of variant nucleotides for a target organism (e.g., a patient) based on variant-to-phenotype scores 622 and optionally identifies a diagnostic variant 626 for the target organism based on the ranking 624.
[0098] As shown in FIG. 6, the variant-to-phenotype prediction system 106 identifies (or receives) a sequencing data file 608 (e.g., a VCF file) indicating variant nucleotides 612 identified within a genomic sample of the subject organism / patient. Further, the variant-to-phenotype prediction system 106 identifies (or receives) phenotype labels 602 determined for an organism and generates, utilizing a disease panel imputation model 604, a disease panel 606 comprising a set of genes given labels for the phenotype labels 602. As illustrated, the disease panel imputation model 604 identifies genes given labels for (e.g., associated with, clinically or otherwise) one or more of the phenotype labels 602 and includes the identified genes within the disease panel 606. Accordingly, the variant-to-phenotype prediction system 106 can identify one or more variant genes of the variant genes 610 comprising the variant nucleotides 612 for the organism / patient.
[0099] As also described above (e.g., in relation to FIG. 3), the variant-to-phenotype prediction system 106 determines the variant-to-phenotype scores 622 for the variant genes 610 by processing, utilizing a variant-to-phenotype machine-learning model 620, one or more of gene-to-phenotype scores 614, gene-level features 616, or variant-level features 618 for the organism. Based on the variant-to-phenotype scores 622, the variant-to-phenotype prediction system 106 determines the ranking 624 of one or more of the variant nucleotides 612 or, alternatively, the variant genes 610 (e.g., from highest variant-to-phenotype score to lowest variant-to-phenotype score). In some cases, the variant-to-phenotype prediction system 106 further selects one or more diagnostic variants, such as the diagnostic variant 626, based on the ranking 624.
[0100] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 provides, via a chatbot user interface of a client device, a machine learning-based conversation agent for interacting with clinical information related to one or more phenotypes determined for an organism / patient and a patient genetic report generated by the variant-to-phenotype prediction system 106. For example, FIG. 7 illustrates an overview of the variant-to-phenotype prediction system 106 utilizing a machine learning-based conversational agent 706 to generate chat responses 720 to user inputs 718 received via a chatbot user interface 704 of a client device 702.
[0101] As shown in FIG. 7, the variant-to-phenotype prediction system 106 generates, accesses, or otherwise identifies disease / phenotype-related clinical information 708 (e.g., genetic research, clinical studies, and other information related to genetic diseases and / or phenotypes) and trains the machine learning-based conversational agent 706 to formulate responses to queries andAttorney Docket No. IP-2865-PCT 27 PCT Patent Applicationother inputs related to the disease / phenotype-related clinical information 708. In some embodiments, the variant-to-phenotype prediction system 106 also trains the machine learningbased conversational agent 706 using exemplary patient genetic reports.
[0102] As illustrated by FIG. 7, in some embodiments, the machine learning-based conversational agent 706 receives as input a patient genetic report 710 generated for a subject organism / patient (e.g., a user of the client device 702 or a representative thereof) according to one or more of the embodiments disclosed herein (e.g., as described above in relation to FIG. 2). For example, the patient genetic report includes variant-to-phenotype scores 712 and diagnostic variant(s) 714 for the subject organism / patient. In some cases, the patient genetic report 710 can also or alternatively include respective false discovery rate(s) 716 determined for the one or more diagnostic variants of the diagnostic variant(s) 714 (e.g., as discussed below in relation to FIG. 11).
[0103] As also shown in FIG. 7, the variant-to-phenotype prediction system 106 provides the chatbot user interface 704 to the client device 702 and receives the user inputs 718 via the chatbot user interface 704. Upon receiving the user inputs 718 via the chatbot user interface 704, the machine learning-based conversational agent 706 generates the chat responses 720 in accordance with the disease / phenotype-related clinical information 708 and / or the patient genetic report 710. In the illustrated example in FIG. 7, for instance, upon receiving a query of “Am I at risk for Alzheimer’s?” via the chatbot user interface 704 on the client device 702, the machine learningbased conversational agent 706 generates, based on the patient genetic report 710, a response of “Yes, you have the APOE E4 gene.”
[0104] Moreover, the variant-to-phenotype prediction system 106 can provide responses to queries related to any of the variant-to-phenotype scores 712, the diagnostic variant(s) 714, or the false discovery rate(s) 716 indicated within the patient genetic report 710, such as queries for specific data and / or queries related to a clinical interpretation of the respective information within the patient genetic report 710. Indeed, by implementing the machine learning-based conversational agent 706 to generate the chat responses 720 in reply to the user inputs 718, the variant-to- phenotype prediction system 106 can provide a user with an automated tool for interacting directly with the patient genetic report 710 and relevant data from the disease / phenotype-related clinical information 708.
[0105] As indicated above, the variant-to-phenotype prediction system 106 can determine links or relationships between variant nucleotides and observed / exhibited phenotypes using a variant-to-phenotype machine-learning model to generate variant-to-phenotype scores indicating probabilities of respective variant nucleotides affecting expression of one or more phenotypes in an organism. For this purpose, the variant-to-phenotype prediction system 106 can train or build the variant-to-phenotype machine-learning model utilizing genomic samples and respectiveAttorney Docket No. IP-2865-PCT 28 PCT Patent Applicationclinical data from one or more cohorts of organisms (e.g., subject patients of prior resolved genetic analyses). FIG. 8 provides an overview of a training process for a variant-to-phenotype machinelearning model 816 to generate variant-to-phenotype scores for target organisms in accordance with one or more embodiments.
[0106] As illustrated in FIG. 8, the variant-to-phenotype prediction system 106 performs an initial training 802 of the variant-to-phenotype machine-learning model 816 utilizing genomic samples and clinical data from one or more cohorts of organisms diagnosed with genetic diseases or phenotypes. In one or more embodiments, for example, the genetic-disease cohort(s) 804 comprise one or more groups of individuals with resolved genetic analyses confirming respective diagnoses for specific genetic diseases. An example of the genetic-disease cohort(s) 804 includes, is but not limited to, a cohort of resolved probands from the Genomics England (GEL) 100,000 Genomes Project.
[0107] As shown in FIG. 8, the variant-to-phenotype prediction system 106 receives (or identifies) respective sequencing data (e.g., VCF files 808) for individual organisms of the genetic- disease cohort(s) 804, as well as respective causative variants 820 (e.g., causative mutations or pathogenic variants that are clinically identified as causing or otherwise affecting expression of a respective individual’s disease or phenotype). Moreover, the variant-to-phenotype prediction system 106 also identifies (or receives) respective disease panels 806 indicating one or more genes given labels for the disease or phenotypes of the individual organisms of the genetic-disease cohort(s) 804.
[0108] As also shown in FIG. 8, based on the disease panels 806 and / or the VCF files 808, the variant-to-phenotype prediction system 106 accesses, determines, or otherwise identifies features or metrics comprising one or more of gene-to-phenotype scores 810, gene-level features 812, or variant-level features 814 for the individual organisms of the genetic-disease cohort(s) 804 and processes the one or more features or metrics utilizing the variant-to-phenotype machine-learning model 816 to generate respective variant-to-phenotype scores 818. Having generated the variant- to-phenotype scores 818, the variant-to-phenotype prediction system 106 performs a model fitting 821 based on a comparison 819 of the variant-to-phenotype scores 818 with the causative variants 820 identified for respective individual organisms of the genetic-disease cohort(s) 804. For example, the variant-to-phenotype prediction system 106 performs modifications or adjustments to parameters (e.g., weights, biases, number of trees, maximum depth, and so forth) depending on the architecture of the variant-to-phenotype machine-learning model 816) of the variant-to-phenotype machine-learning model 816 to reduce a measure of loss indicated by the comparison 819 and to use the adjusted parameters on a subsequent training iteration.Attorney Docket No. IP-2865-PCT 29 PCT Patent Application
[0109] As mentioned above, in some embodiments, the variant-to-phenotype prediction system 106 performs a further training 822 in addition to the initial training 802. As shown in FIG. 8, for example, the variant-to-phenotype prediction system 106 utilizes, during the further training 822, one or more control cohort(s) 824 of organisms not diagnosed with a target genetic disease. In one or more embodiments, for example, the control cohort(s) 824 comprise one or more groups of individuals with resolved genetic analyses that are unaffected and / or not diagnosed with a particular phenotype or disease. The control cohort(s) 824 include, but are not limited to, a control cohort of resolved genomes from the UK Biobank (UKBB) database and / or unaffected family members of diagnosed probands and / or control specimens from the Genomics England (GEL) 100,000 Genomes Project.
[0110] As with the genetic-disease cohort(s) 804, the variant-to-phenotype prediction system 106 receives (or identifies) respective sequencing data (e.g., VCF files 828) for individual organisms of the control cohort(s) 824. As illustrated in FIG. 8, the variant-to-phenotype prediction system 106 also determines respectively assigned disease panels 826 for individual organisms of the control cohort(s) 824. In some embodiments, for example, the assigned disease panels 826 are based on one or more target phenotypes (e.g., phenotypes associated with a target disease) assigned to individual organisms of the control cohort(s) 824 for purposes of the further training 822.
[0111] Accordingly, based on the VCF files 828 and / or the assigned disease panels 826, the variant-to-phenotype prediction system 106 accesses, determines, or otherwise identifies features or metrics comprising one or more of gene-to-phenotype scores 830, gene-level features 832, or variant-level features 834 (e.g., consistent with the features or metrics considered during for the initial training 802) and utilizes the variant-to-phenotype machine-learning model 816 to generate variant-to-phenotype-specific score distributions 836 of variant-to-phenotype scores for the target phenotypes respectively assigned to the individual organisms of the control cohort(s) 824. Based on the variant-to-phenotype-specific score distributions 836, the variant-to-phenotype prediction system 106 further performs the model fitting 821. In embodiments comprising a random forest architecture, for example, the variant-to-phenotype prediction system 106 can implement the variant-to-phenotype-specific score distributions 836 as an additional feature (e.g., an additional tree or branch). Additionally or alternatively, in some embodiments, the variant-to-phenotype prediction system 106 utilizes the variant-to-phenotype machine-learning model to process the variant-to-phenotype-specific score distributions 836 as an additional input feature for generating variant-to-phenotype scores for using the variant-to-phenotype machine-learning model 816.
[0112] As mentioned above, in certain embodiments, the variant-to-phenotype prediction system 106 identifies variant nucleotides that affect expression of phenotypes of target organisms with increased accuracy relative to existing genetic testing systems. To illustrate, FIGS. 9A and 9BAttorney Docket No. IP-2865-PCT 30 PCT Patent Applicationshow comparative experimental results of the variant-to-phenotype prediction system 106 identifying variant nucleotides that affect expression of phenotypes with increased accuracy over an existing system. In particular, FIGS. 9A-9B show comparative results of identifying diagnostic variants utilizing (i) the variant-to-phenotype prediction system 106 (labelled as “VPP System 106” in FIGS. 9A-9B) and (ii) the existing system known as “Exomiser” (labelled as “Existing System” in FIGS. 9A-9B).
[0113] For instance, FIG. 9A illustrates comparative experimental results in a bar graph portraying the accuracy of identifying the top 1 to 10 diagnostic variants for target organisms (e.g., the N most likely variants to be causative (or otherwise affect expression) of the respective organism’s phenotype or disease). More specifically, the bar graph of FIG. 9A indicates respective levels of concordance, indicated as a percentage, with a phenotype-affecting nucleotide variant identified by clinical genetic analysis expressed as a percentage for each of the top 1, top 2, top 3, top 4, top 5, and top 10 variants (across the horizontal axis) most likely to be causative of or otherwise affect expression of the respective organisms’ phenotypes (labelled as “Top N accuracy (%)” in FIG. 9A). For the results achieved utilizing the variant-to-phenotype prediction system 106, for example, the top N variants correspond to the N variants with the highest variant-to- phenotype scores.
[0114] Relatedly, FIG. 9B shows comparative experimental results in respective plots corresponding to (i) specific genetic diseases and (ii) groups of related genetic diseases (e.g., diseases with similar phenotypic profiles). More specifically, the plots of FIG. 9B indicate comparative levels of concordance, indicated as a percentage, with a phenotype-affecting nucleotide identified by clinical genetic analysis expressed as a percentage for the variant most likely to be causative (or otherwise affect expression) of the respective organisms’ phenotypes (labelled as “Top 1 Accuracy (%)” in FIG. 9B). As indicated in FIG. 9B, the size of each circle corresponds to a number of probands analyzed, whereas the orthogonal lines accompanying each dot represent a 95% error bar for each plot entry.
[0115] As demonstrated by the experimental results portrayed in FIGS. 9A-9B, the variant- to-phenotype prediction system 106 more accurately identifies variant nucleotides that affect expression of phenotypes (e.g., diagnostic variants) compared to the existing “Exomiser” system. As shown in FIG. 9A, for example, the variant-to-phenotype prediction system 106 identified the top 3 diagnostic variants in 98.39% concordance with clinical genetic analysis results, whereas the existing system only achieved a concordance of 90.1%. Furthermore, as indicated in both FIGS. 9A and 9B, the variant-to-phenotype prediction system 106 consistently outperforms the existing system in terms of the percent of phenotype-affecting variant nucleotides accurately identified by the respective systems.Attorney Docket No. IP-2865-PCT 31 PCT Patent Application
[0116] Moreover, in some embodiments, the accuracy of the variant-to-phenotype prediction system 106 is further improved when implementing genetic information of organisms related to the target organism (e.g., by the family analysis 326 as discussed above in relation to FIG. 3). To illustrate, FIG. 10 shows comparative experimental results of the variant-to-phenotype prediction system 106 determining diagnostic variants for target organisms based on (i) analysis of individual organisms’ samples without genetic information from relatives of the respective organisms (labelled “singleton” in FIG. 10) and (ii) analysis of the individual organisms’ samples with additional genetic information from relatives of the respective organisms (labelled “family” in FIG. 10), for both dominant variant genes and recessive variant genes. Indeed, as illustrated in FIG. 10, the variant-to-phenotype prediction system 106 can achieve further increased accuracy in identifying diagnostic variants or other phenotype-affecting variant nucleotides by considering genetic information for related organisms (e.g., based on one or more sets of variant-to-phenotype scores for one or more organisms related to the target organism in addition to variant-to-phenotype scores for the target organism).
[0117] As mentioned, in some embodiments, the variant-to-phenotype prediction system 106 provides metrics measuring the confidence in output results by quantifying a target model’s false discovery rate (FDR) using a cohort of organisms diagnosed with genetic diseases or phenotypes (e.g., a rare disease cohort of input samples) to identify a true positive rate (TPR) and using a cohort of organisms not diagnosed with a target genetic disease or phenotype (e.g., a healthy cohort of input samples) to identify a false positive rate (FPR). For example, FIG. I l a method of determining a false discovery rate 1112 of determining diagnostic variants utilizing the variant-to-phenotype prediction system 106. While FIG. 11 shows a method of determining a false discovery rate for the variant-to-phenotype prediction system 106, the methods described in relation to FIG. 11 can be implemented to determine a false discovery rate for systems or models other than the variant-to- phenotype prediction system 106, including models or systems supplemented or aided by human genetic experts.
[0118] As shown in FIG. 11, for example, the variant-to-phenotype prediction system 106 identifies diagnostic variants for one or more genetic-disease cohort(s) 1102 of organisms diagnosed with one or more target genetic diseases according to one or more embodiments (e.g., as described above in relation to FIG. 6). In some embodiments, for example, the genetic-disease cohort(s) 1102 comprise one or more groups of individuals with resolved genetic analyses confirming respective diagnoses for specific genetic diseases, such as, but not limited to, a cohort of resolved probands from the Genomics England (GEL) 100,000 Genomes Project. Further, in the same or other embodiments, one or more of the genetic-disease cohort(s) 1102 comprise individuals with confirmed diagnoses for specific genetic diseases but do not necessarily haveAttorney Docket No. IP-2865-PCT 32 PCT Patent Applicationresolved genetic analysis identifying causative genes. As illustrated, the variant-to-phenotype prediction system 106 classifies diagnostic variants identified for organisms of the genetic-disease cohort(s) 1102 as true positives to determine a true-positive rate 1106 for the model.
[0119] As also shown in FIG. 11, the variant-to-phenotype prediction system 106 further identifies diagnostic variants for one or more control cohort(s) 1104 of organisms not diagnosed with the one or more target genetic diseases (e.g., organisms confirmed as undiagnosed but assigned target diseases as described above in relation to FIG. 8). In one or more embodiments, for example, the control cohort(s) 824 comprise one or more groups of individuals with resolved genetic analyses that are unaffected and / or not diagnosed with a particular phenotype or disease. The control cohort(s) 824 include, but are not limited to, a cohort of resolved genomes from the UK Biobank (UKBB) database and / or unaffected family members of diagnosed probands and / or control specimens from the GEL 100,000 Genomes Project. Such a control cohort of resolve genomes from the UKBB database could include, for instance, a random sampling of individuals from a general population (e.g., with data stored in the UKBB database) not diagnosed with a particular phenotype or disease (e.g., a target rare disease). As illustrated, the variant-to-phenotype prediction system 106 classifies the diagnostic variants identified for organisms of the control cohort(s) 1104 as true positives to determine a true-positive rate 1106 for the model.
[0120] As further illustrated in FIG. 11, the variant-to-phenotype prediction system 106 determines the false discovery rate 1112 by determining a ratio of the true-positive rate 1106 and the false-positive rate 1108. However, in some embodiments, the variant-to-phenotype prediction system 106 filters diagnostic variants according to a diagnosis score threshold 1110 (e.g., removing diagnostic variants with variant-to-phenotype scores below the diagnosis score threshold 1110) when determining the true-positive rate 1106 and / or the false-positive rate 1108 for the false discovery rate 1112. For example, in some cases, the variant-to-phenotype prediction system 106 estimates the true-positive rate 1106 at the diagnosis score threshold 1110 based on the proportion of individuals from the genetic-disease cohort(s) 1102 that exhibit at least one variant that scores above the diagnosis score threshold 1110. Subsequently and at certain embodiments, at the same value for the diagnosis score threshold 1110, the variant-to-phenotype prediction system 106 estimates the false-positive rate 1108 by calculating the proportion of population controls (e.g., a cohort of resolved genomes from UKBB) that exhibit at least one variant that scores above the diagnosis score threshold 1110.
[0121] As mentioned, in certain embodiments, the variant-to-phenotype prediction system 106 exhibits improved diagnostic performance relative to previous models in terms of false discovery rates of identifying diagnostic variants or other phenotype-affecting variant nucleotides. To illustrate, FIGS. 12-13 comparative experimental results of identifying diagnostic variants forAttorney Docket No. IP-2865-PCT 33 PCT Patent Applicationrespective genetic diseases (e.g., by analyzing genomic samples for organisms diagnosed with the respective genetic diseases). For instance, FIG. 12 provides various graph plots corresponding to different genetic diseases, each plot indicating a diagnostic yield in relation to a false discovery rate of identifying diagnostic variants utilizing (i) the variant-to-phenotype prediction system 106 (labelled as “VPP System 106” in FIG. 12) and (ii) the existing system known as “Exomiser” (labelled as “Existing System” in FIG. 12). Indeed, the variant-to-phenotype prediction system 106 consistently exhibits an increased diagnostic yield at a decreased false discovery rate relative to the existing system, as indicated by the graph plots of FIG. 12. As shown in the graph plots of Figure 12, the y-axes show diagnostic yield in terms of percentages (e.g., 0%, 25%, 50%, 75%) and the x- axes show false discovery rates in terms of percentages (e.g., 0%, 10%, 30%, 40%, 50%).
[0122] To further illustrate, FIG. 13 shows additional experimental results of identifying diagnostic variant utilizing the variant-to-phenotype prediction system 106. Specifically, FIG. 13 includes (i) a bar graph indicating a number of cases for which the variant-to-phenotype prediction system 106 newly identified diagnostic variants, (ii) a graph plot of false discovery rates of using the variant-to-phenotype prediction system 106 to identify diagnostic variants in previously unsolved cases (labelled as “GEL Unsolved” in FIG. 13) in relation to false discovery rates of using the variant-to-phenotype prediction system 106 to identify diagnostic variants in cases already solved by GEL (labelled as “GEL Solved” in FIG. 13), and (iii) a bar graph of diagnostic yields exhibited by GEL and / or the variant-to-phenotype prediction system 106 (labelled as “VPP System” in FIG. 13).
[0123] As shown in FIG. 13, the variant-to-phenotype prediction system 106 identifies a significant number of diagnostic variants in previously unsolved cases. As indicated in the bar graph on the left side and the graph plot in the middle of FIG. 13, for example, the variant-to- phenotype prediction system 106 solved approximately 4,500 cases for which diagnostic variants have not been identified by clinical genetic research with a false discovery rate of 5% and under 2,000 cases with a false discovery rate of less than 1%. Further, as shown in the bar graph on the right side of FIG. 13, the variant-to-phenotype prediction system 106 exhibits an increased diagnostic yield relative to the number of GEL solved cases. Indeed, as demonstrated by the experimental results indicated in FIG. 13, the variant-to-phenotype prediction system 106 exhibits improved diagnostic performance relative to previous models.
[0124] Turning now to FIG. 14, this figure illustrates an example flowchart of a series of acts for generating variant-to-phenotype scores for variant nucleotides of a genomic sample according to one or more embodiments. While FIG. 14 illustrates acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and / or modify any of the acts shown in FIG. 14. The acts of FIG. 14 can be performed as part of a method. Alternatively, a non -transitoryAttorney Docket No. IP-2865-PCT 34 PCT Patent Applicationcomputer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 14. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 14.
[0125] As shown in FIG. 14, the series of acts 1400 includes an act 1402 of identifying, from a set of genes given labels for one or more phenotypes, variant genes comprising variant nucleotides within a genomic sample of an organism, an act 1404 of accessing gene-level features determined for one or more genes of the set of genes in relation to a reference genome of the organism, an act 1406 of accessing gene-to-phenotype scores indicating respective probabilities of the set of genes being associated with the one or more phenotypes, and an act 1408 of generating, utilizing a variant- to-phenotype machine-learning model, variant-to-phenotype scores indicating respective probabilities of the variant nucleotides affecting expression of the one or more phenotypes in the organism of the genomic sample.
[0126] For example, the series of acts 1400 can include acts to perform any of the operations described in the following clauses:CLAUSE 1. A method comprising: identifying, from a set of genes given labels for one or more phenotypes, variant genes comprising variant nucleotides within a genomic sample of an organism; accessing gene-level features determined for one or more genes of the set of genes in relation to a reference genome of the organism; accessing gene-to-phenotype scores indicating respective probabilities of the set of genes being associated with the one or more phenotypes; and generating, utilizing a variant-to-phenotype machine-learning model to process the genelevel features and the gene-to-phenotype scores, a set of variant-to-phenotype scores indicating respective probabilities of the variant nucleotides affecting expression of the one or more phenotypes in the organism of the genomic sample.CLAUSE 2. The method of clause 1, further comprising accessing variant-level features associated with the variant nucleotides identified within the variant genes of the genomic sample; and generating the set of variant-to-phenotype scores further based on the variant-to-phenotype machine-learning model processing the variant-level features.Attorney Docket No. IP-2865-PCT 35 PCT Patent ApplicationCLAUSE 3. The method of clause 2, wherein the variant-level features comprise one or more of: pathogenicity scores indicating a degree to which the variant nucleotides or variant proteins corresponding to the variant nucleotides are benign or pathogenic; splice-site scores indicating a probability of one or more of the variant nucleotides being part of a splice site for nucleotide sequences encoding precursor messenger ribonucleic acid (pre- mRNA) or other RNA; allele frequencies corresponding to the variant nucleotides; genotypes for the variant genes of the genomic sample; loss-of-function statuses indicating that the variant nucleotides reduce or destroy protein function; de novo statuses of the variant nucleotides as private variants; enhancer scores indicating a probability of one or more of the variant nucleotides being part of an enhancer gene sequence that regulates gene expression; promoter sequence scores indicating a probability of one or more of the variant nucleotides being part of a promoter sequence that regulates gene expression; or untranslated region scores indicating a degree to which variant nucleotides within untranslated regions of respective genes are benign or pathogenic.CLAUSE 4. The method of any of clauses 1-3, wherein the gene-level features for the one or more genes of the set of genes comprise one or more of: probabilities of the one or more genes being loss-of-function intolerant; lengths of the one or more genes within the reference genome; average levels of messenger ribonucleic acid (mRNA) expression of the one or more genes across different tissues; or probabilities of the one or more genes being recessive for inheritance.CLAUSE 5. The method of any of clauses 1-4, wherein the given labels for the one or more phenotypes are determined for, assigned to, or otherwise associated with the organism.CLAUSE 6. The method of any of clauses 1-5, further comprising: determining the one or more phenotypes for the organism based on identifiable characteristics of the organism; and identifying the set of genes given labels for the one or more phenotypes determined for the organism utilizing a disease panel imputation model to process the one or more phenotypes.Attorney Docket No. IP-2865-PCT 36 PCT Patent ApplicationCLAUSE 7. The method of any of clauses 1-6, further comprising determining, based on the set of variant-to-phenotype scores, the genomic sample of the organism comprises one or more diagnostic variants associated with a phenotype of the one or more phenotypes.CLAUSE 8. The method of clause 7, further comprising: determine an inheritance pattern of the variant nucleotides within the genomic sample of the organism; and determine the genomic sample of the organism comprises the one or more diagnostic variants based on the set of variant-to-phenotype scores and the inheritance pattern.CLAUSE 9. The method of clause 7, further comprising determining the genomic sample of the organism comprises the one or more diagnostic variants based on the set of variant- to-phenotype scores for the organism and one or more additional sets of variant-to-phenotype scores for one or more additional organisms related to the organism.CLAUSE 10. The method of any of clauses 1-9, wherein the variant-to-phenotype machine-learning model comprises a logistic regression model, a linear regression model, a decision tree, a neural network, or a random forest model.CLAUSE 11. The method of any of clauses 1-10, wherein the variant-to-phenotype machine-learning model is trained utilizing genomic samples and clinical data from a cohort of organisms diagnosed with genetic diseases or phenotypes.CLAUSE 12. The method of clause 11, wherein the variant-to-phenotype machinelearning model is further trained by: generating, for genomic samples of a cohort of organisms not diagnosed with a target genetic disease, variant-to-phenotype-specific score distributions of variant-to-phenotype scores indicating respective probabilities of variant nucleotides within the genomic samples affecting expression of one or more phenotypes associated of the target genetic disease within the cohort of organisms; and adjusting parameters of the variant-to-phenotype machine-learning model based on the variant-to-phenotype-specific score distributions.CLAUSE 13. The method of any of clauses 1-12, further comprising:Attorney Docket No. IP-2865-PCT 37 PCT Patent Applicationaccessing, for the one or more phenotypes, distribution features for one or more variant-to- phenotype-specific score distributions indicating respective distributions of variant-to-phenotype scores generated for a cohort of organisms not diagnosed with target genetic diseases; and generating the set of variant-to-phenotype scores further based on the variant-to-phenotype machine-learning model processing the distribution features for the one or more variant-to- phenotype-specific score distributions.CLAUSE 14. The method of any of clauses 1-13, further comprising: determining a true positive rate of identifying a diagnostic variant utilizing the variant-to- phenotype machine-learning model based on processing samples from a cohort of organisms diagnosed with one or more target genetic diseases; determining a false positive rate of identifying the diagnostic variant utilizing the variant- to-phenotype machine-learning model based on processing samples from a cohort of organisms not diagnosed with the one or more target genetic diseases; and determining a false discovery rate for the diagnostic variant based on the true positive rate and the false positive rate for the diagnostic variant.CLAUSE 15. The method of any of clauses 1-14, further comprising generating a patient genetic report comprising one or more of: the set of variant-to-phenotype scores for the genomic sample; one or more diagnostic variants associated with a phenotype of the one or more phenotypes and determined based on the set of variant-to-phenotype scores; or one or more false discovery rates associated with the one or more diagnostic variants.CLAUSE 16. The method of clause 15, further comprising: providing, via a chatbot user interface of a client device, a machine learning-based conversational agent for interacting with the patient genetic report and clinical information related to the one or more phenotypes; receiving, from the client device through the chatbot user interface, an input requesting data related to one or more of the patient genetic report or the clinical information; generating, utilizing the machine learning-based conversational agent, a response to the input; and providing the response to the client device through the chatbot user interface.Attorney Docket No. IP-2865-PCT 38 PCT Patent ApplicationCLAUSE 17. The method of any of clauses 1-16, wherein the gene-to-phenotype scores are generated for the set of genes by a gene embedding neural network comprising parameters learned from one or more gene-to-gene graphs.CLAUSE 18. The method of any of clause 1-17, further comprising generating a gene-to- phenotype matrix indicating probabilities of the set of genes being associated with individual phenotype labels corresponding to the one or more phenotypes.CLAUSE 19. The method of clause 18, further comprising: determining dissimilarity weights indicating respective measures of diversity between individual phenotype labels corresponding to the one or more phenotypes; and applying the dissimilarity weights to the gene-to-phenotype matrix to determine individual gene-to-phenotype scores of the gene-to-phenotype scores for respective individual genes of the set of genes.
[0127] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase type from another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.
[0128] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
[0129] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditionsAttorney Docket No. IP-2865-PCT 39 PCT Patent Applicationused as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).
[0130] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
[0131] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) "Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) "Pyrosequencing sheds light on DNA sequencing." Genome Res. 11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on realtime pyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S. Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.Attorney Docket No. IP-2865-PCT 40 PCT Patent Application
[0132] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04 / 018497 and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91 / 06678 and WO 07 / 123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.
[0133] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remain unchanged in the images. Images obtained from such reversible terminator- SBS methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.
[0134] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators / cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 secondAttorney Docket No. IP-2865-PCT 41 PCT Patent Applicationexposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and / or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U.S. Pat. No. 7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.
[0135] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Patent Application Publication No. 2007 / 0166705, U.S. Patent Application Publication No. 2006 / 0188901, U.S. Pat. No. 7,057,026, U.S. Patent Application Publication No. 2006 / 0240439, U.S. Patent Application Publication No. 2006 / 0281109, PCT Publication No. WO 05 / 065814, U.S. Patent Application Publication No. 2005 / 0100900, PCT Publication No. WO 06 / 064199, PCT Publication No. WO 07 / 010,251, U.S. Patent Application Publication No. 2012 / 0270305 and U.S. Patent Application Publication No. 2013 / 0260372, the disclosures of which are incorporated herein by reference in their entireties.
[0136] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U.S. Patent Application Publication No. 2013 / 0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished based on a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitationAttorney Docket No. IP-2865-PCT 42 PCT Patent Applicationwavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and / or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).
[0137] Further, as described in the incorporated materials of U.S. Patent Application Publication No. 2013 / 0079232, sequencing data can be obtained using a single channel. In such so-called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.
[0138] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the features will remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U.S. Pat. No. 6,1469,488, U.S. Pat. No. 6,172,218, and U.S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.
[0139] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. "Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147- 151 (2000); Deamer, D. and D. Branton, "Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, "DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, "A. ProgressAttorney Docket No. IP-2865-PCT 43 PCT Patent Applicationtoward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. "Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. "A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.
[0140] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate- labeled nucleotides as described, for example, in U.S. Pat. No. 7,329,492 and U.S. Pat. No. 7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U.S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U.S. Pat. No. 7,405,281 and U.S. Patent Application Publication No. 2008 / 0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. "Zero-mode waveguides for single-molecule analysis at high concentrations." Science 299, 682-686 (2003); Lundquist, P. M. et al. "Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1526-1528 (2008); Korlach, J. et al. "Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.
[0141] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009 / 0026082 Al; US 2009 / 0127589 Al; US 2010 / 0137143 Al; or US 2010 / 0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.Attorney Docket No. IP-2865-PCT 44 PCT Patent Application
[0142] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.
[0143] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features / cm2, 100 features / cm2, 500 features / cm2, 1,000 features / cm2, 5,000 features / cm2, 10,000 features / cm2, 50,000 features / cm2, 100,000 features / cm2, 1,000,000 features / cm2, 5,000,000 features / cm2, or higher.
[0144] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniques known in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and / or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and / or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010 / 0111768 Al and US Ser. No. 13 / 273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego,Attorney Docket No. IP-2865-PCT 45 PCT Patent ApplicationCA) and devices described in US Ser. No. 13 / 273,666, which is incorporated herein by reference. The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device.
[0145] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and / or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of human identification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.
[0146] The components of the variant-to-phenotype prediction system 106 can include software, hardware, or both. For example, the components of the variant-to-phenotype prediction system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 102 or the server device(s) 108). When executed by the one or more processors, the computer-executable instructions of the variant-to-phenotype prediction system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the variant-to-phenotype prediction system 106 can comprise hardware, such as special purposeAttorney Docket No. IP-2865-PCT 46 PCT Patent Applicationprocessing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the variant-to-phenotype prediction system 106 can include a combination of computer-executable instructions and hardware.
[0147] Furthermore, the components of the variant-to-phenotype prediction system 106 performing the functions described herein with respect to the variant-to-phenotype prediction system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and / or as a cloud-computing model. Thus, components of the variant-to- phenotype prediction system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the variant-to-phenotype prediction system 106 may be implemented in any application that provides sequencing services including, but not limited to Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and / or other countries.
[0148] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and / or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non- transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.
[0149] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.
[0150] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storageAttorney Docket No. IP-2865-PCT 47 PCT Patent Applicationor other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.
[0151] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and / or modules and / or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and / or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer- readable media.
[0152] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and / or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer- readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.
[0153] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and / or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
[0154] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-Attorney Docket No. IP-2865-PCT 48 PCT Patent Applicationprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
[0155] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.
[0156] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.
[0157] FIG. 15 illustrates a block diagram of a computing device 1500 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1500 may implement the variant-to-phenotype prediction system 106 and the sequencing system 116. As shown by FIG. 15, the computing device 1500 can comprise a processor 1502, a memory 1504, a storage device 1506, an I / O interface 1508, and a communication interface 1510, which may be communicatively coupled by way of a communication infrastructure 1515. In certain embodiments, the computing device 1500 can include fewer or more components than those shown in FIG. 15. The following paragraphs describe components of the computing device 1500 shown in FIG. 15 in additional detail.
[0158] In one or more embodiments, the processor 1502 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1502 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1504, orAttorney Docket No. IP-2865-PCT 49 PCT Patent Applicationthe storage device 1506 and decode and execute them. The memory 1504 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1506 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.
[0159] The I / O interface 1508 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1500. The I / O interface 1508 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I / O devices or a combination of such I / O interfaces. The I / O interface 1508 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I / O interface 1508 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and / or any other graphical content as may serve a particular implementation.
[0160] The communication interface 1510 can include hardware, software, or both. In any event, the communication interface 1510 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1500 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1510 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.
[0161] Additionally, the communication interface 1510 may facilitate communications with various types of wired or wireless networks. The communication interface 1510 may also facilitate communications using various communication protocols. The communication infrastructure 1515 may also include hardware, software, or both that couples components of the computing device 1500 to each other. For example, the communication interface 1510 may use one or more networks and / or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.
[0162] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrativeAttorney Docket No. IP-2865-PCT 50 PCT Patent Applicationof the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.
[0163] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps / acts or the steps / acts may be performed in differing orders. Additionally, the steps / acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps / acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.Attorney Docket No. IP-2865-PCT 51 PCT Patent Application
Claims
1. CLAIMSWe Claim:
1. A system comprising: at least one processor; and a non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to: identify, from a set of genes given labels for one or more phenotypes, variant genes comprising variant nucleotides within a genomic sample of an organism; access gene-level features determined for one or more genes of the set of genes in relation to a reference genome of the organism; access gene-to-phenotype scores indicating respective probabilities of the set of genes being associated with the one or more phenotypes; and generate, utilizing a variant-to-phenotype machine-learning model to process the gene-level features and the gene-to-phenotype scores, a set of variant-to-phenotype scores indicating respective probabilities of the variant nucleotides affecting expression of the one or more phenotypes in the organism of the genomic sample.
2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: access variant-level features associated with the variant nucleotides identified within the variant genes of the genomic sample; and generate the set of variant-to-phenotype scores further based on the variant-to-phenotype machine-learning model processing the variant-level features.
3. The system of claim 2, wherein the variant-level features comprise one or more of: pathogenicity scores indicating a degree to which the variant nucleotides or variant proteins corresponding to the variant nucleotides are benign or pathogenic; splice-site scores indicating a probability of one or more of the variant nucleotides being part of a splice site for nucleotide sequences encoding precursor messenger ribonucleic acid (pre- mRNA) or other RNA; allele frequencies corresponding to the variant nucleotides; genotypes for the variant genes of the genomic sample; loss-of-function statuses indicating that the variant nucleotides reduce or destroy protein function;Attorney Docket No. IP-2865-PCT 52 PCT Patent Applicationde novo statuses of the variant nucleotides as private variants; enhancer scores indicating a probability of one or more of the variant nucleotides being part of an enhancer gene sequence that regulates gene expression; promoter sequence scores indicating a probability of one or more of the variant nucleotides being part of a promoter sequence that regulates gene expression; or untranslated region scores indicating a degree to which variant nucleotides within untranslated regions of respective genes are benign or pathogenic.
4. The system of claim 1, wherein the gene-level features for the one or more genes of the set of genes comprise one or more of: probabilities of the one or more genes being loss-of-function intolerant; lengths of the one or more genes within the reference genome; average levels of messenger ribonucleic acid (mRNA) expression of the one or more genes across different tissues; or probabilities of the one or more genes being recessive for inheritance.
5. The system of claim 1, wherein the given labels for the one or more phenotypes are determined for, assigned to, or otherwise associated with the organism.
6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine the one or more phenotypes for the organism based on identifiable characteristics of the organism; and identify the set of genes given labels for the one or more phenotypes determined for the organism utilizing a disease panel imputation model to process the one or more phenotypes.
7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine, based on the set of variant-to-phenotype scores, the genomic sample of the organism comprises one or more diagnostic variants associated with a phenotype of the one or more phenotypes.
8. The system of claim 7, further comprising instructions that, when executed by the at least one processor, cause the system to: determine an inheritance pattern of the variant nucleotides within the genomic sample of the organism; andAttorney Docket No. IP-2865-PCT 53 PCT Patent Applicationdetermine the genomic sample of the organism comprises the one or more diagnostic variants based on the set of variant-to-phenotype scores and the inheritance pattern.
9. The system of claim 7, further comprising instructions that, when executed by the at least one processor, cause the system to determine the genomic sample of the organism comprises the one or more diagnostic variants based on the set of variant-to-phenotype scores for the organism and one or more additional sets of variant-to-phenotype scores for one or more additional organisms related to the organism.
10. The system of claim 1, wherein the variant-to-phenotype machine-learning model comprises a logistic regression model, a linear regression model, a decision tree, a neural network, or a random forest model.
11. The system of claim 1, wherein the variant-to-phenotype machine-learning model is trained utilizing genomic samples and clinical data from a cohort of organisms diagnosed with genetic diseases or phenotypes.
12. The system of claim 11, wherein the variant-to-phenotype machine-learning model is further trained by: generating, for genomic samples of a cohort of organisms not diagnosed with a target genetic disease, variant-to-phenotype-specific score distributions of variant-to-phenotype scores indicating respective probabilities of variant nucleotides within the genomic samples affecting expression of one or more phenotypes associated of the target genetic disease within the cohort of organisms; and adjusting parameters of the variant-to-phenotype machine-learning model based on the variant-to-phenotype-specific score distributions.
13. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: access, for the one or more phenotypes, distribution features for one or more variant-to- phenotype-specific score distributions indicating respective distributions of variant-to-phenotype scores generated for a cohort of organisms not diagnosed with target genetic diseases; and generate the set of variant-to-phenotype scores further based on the variant-to-phenotype machine-learning model processing the distribution features for the one or more variant-to- phenotype-specific score distributions.Attorney Docket No. IP-2865-PCT 54 PCT Patent Application14. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: determine a true positive rate of identifying a diagnostic variant utilizing the variant-to- phenotype machine-learning model based on processing samples from a cohort of organisms diagnosed with one or more target genetic diseases; determine a false positive rate of identifying the diagnostic variant utilizing the variant-to- phenotype machine-learning model based on processing samples from a cohort of organisms not diagnosed with the one or more target genetic diseases; and determine a false discovery rate for the diagnostic variant based on the true positive rate and the false positive rate for the diagnostic variant.
15. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate a patient genetic report comprising one or more of: the set of variant-to-phenotype scores for the genomic sample; one or more diagnostic variants associated with a phenotype of the one or more phenotypes and determined based on the set of variant-to-phenotype scores; or one or more false discovery rates associated with the one or more diagnostic variants.
16. They system of claim 15, further comprising instructions that, when executed by the at least one processor, cause the system to: provide, via a chatbot user interface of a client device, a machine learning-based conversational agent for interacting with the patient genetic report and clinical information related to the one or more phenotypes; receive, from the client device through the chatbot user interface, an input requesting data related to one or more of the patient genetic report or the clinical information; generate, utilizing the machine learning-based conversational agent, a response to the input; and provide the response to the client device through the chatbot user interface.
17. The system of claim 1, wherein the gene-to-phenotype scores are generated for the set of genes by a gene embedding neural network comprising parameters learned from one or more gene-to-gene graphs.Attorney Docket No. IP-2865-PCT 55 PCT Patent Application18. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate a gene-to-phenotype matrix indicating probabilities of the set of genes being associated with individual phenotype labels corresponding to the one or more phenotypes.
19. The system of claim 18, further comprising instructions that, when executed by the at least one processor, cause the system to: determine dissimilarity weights indicating respective measures of diversity between individual phenotype labels corresponding to the one or more phenotypes; and apply the dissimilarity weights to the gene-to-phenotype matrix to determine individual gene-to-phenotype scores of the gene-to-phenotype scores for respective individual genes of the set of genes.
20. A computer-implemented method: identifying, from a set of genes given labels for one or more phenotypes, variant genes comprising variant nucleotides within a genomic sample of an organism; accessing gene-level features determined for one or more genes of the set of genes in relation to a reference genome of the organism; accessing gene-to-phenotype scores indicating respective probabilities of the set of genes being associated with the one or more phenotypes; and generating, utilizing a variant-to-phenotype machine-learning model to process the genelevel features and the gene-to-phenotype scores, a set of variant-to-phenotype scores indicating respective probabilities of the variant nucleotides affecting expression of the one or more phenotypes in the organism of the genomic sample.
21. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: identify, from a set of genes given labels for one or more phenotypes, variant genes comprising variant nucleotides within a genomic sample of an organism; access gene-level features determined for one or more genes of the set of genes in relation to a reference genome of the organism; access gene-to-phenotype scores indicating respective probabilities of the set of genes being associated with the one or more phenotypes; and generate, utilizing a variant-to-phenotype machine-learning model to process the genelevel features and the gene-to-phenotype scores, a set of variant-to-phenotype scores indicatingAttorney Docket No. IP-2865-PCT 56 PCT Patent Applicationrespective probabilities of the variant nucleotides affecting expression of the one or more phenotypes in the organism of the genomic sample.Attorney Docket No. IP-2865-PCT 57 PCT Patent Application