Calling somatic variants from sequencing data

The somatic-variant calling system uses genetic thresholds and phasing factors to accurately detect somatic variants in fluid samples, overcoming limitations of existing systems by reducing DNA input and improving detection accuracy and speed.

WO2026128752A1PCT designated stage Publication Date: 2026-06-18ILLUMINA INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ILLUMINA INC
Filing Date
2025-12-11
Publication Date
2026-06-18

Smart Images

  • Figure US2025059272_18062026_PF_FP_ABST
    Figure US2025059272_18062026_PF_FP_ABST
Patent Text Reader

Abstract

This disclosure describes methods, non-transitory computer readable media, and systems that use an improved somatic variant calling approach to detect somatic variants in a sample. For example, the disclosed system can generate variant calls for a genomic sample and identify candidate variants satisfying thresholds for variant allele fractions (VAFs). The disclosed system can further identify, from phased reads, candidate variants that segregate into a portion of one parental haplotype. The disclosed system can further evaluate such candidate somatic variants by determining that the VAF of the potential somatic variants deviate from VAFs of a subset of germline variants. Based on a candidate somatic variant exhibiting a VAF that satisfies threshold VAFs, phasing that segregates into one parental haplotype, and the candidate somatic variant's VAF deviating from a VAF of local germline variants, the disclosed system can more accurately detect a somatic variant.
Need to check novelty before this filing date? Find Prior Art

Description

CALLING SOMATIC VARIANTS FROM SEQUENCING DATACROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims priority to and the benefit of U. S. Provisional Patent Application No. 63 / 733,377, filed on December 12, 2024, entitled “CALLING SOMATIC VARIANTS FROM SEQUENCING DATA,” (IP-2891-PRV), which is incorporated herein by reference in its entirety.BACKGROUND

[0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining nucleobase calls for reads and variant calls for genomic samples. For instance, some existing sequencing machines and sequencing-data-analysis software (together “existing sequencing systems”) predict individual nucleobases within sequences by using conventional Sanger sequencing or sequencing-by-synthesis (SBS) methods. When using SBS, existing sequencing systems can monitor many millions to billions of oligonucleotides being synthesized in parallel from templates to predict nucleobase calls for growing nucleotide reads. In many existing sequencing systems, a camera captures images of irradiated fluorescent tags incorporated into oligonucleotides. After capturing such images, some existing sequencing systems determine nucleobase calls for nucleotide reads corresponding to the oligonucleotides and send base-call data to a computing device with sequencing-data-analysis software, which aligns nucleotide reads with a reference genome. Based on differences between the aligned nucleotide reads and the reference genome, existing systems can further utilize a variant caller to identify variants of a genomic sample, such as single nucleotide variants (SNVs), insertions or deletions (indels), or other variants within the genomic sample.

[0003] In addition to improved genomic sequencing, existing sequencing systems have also improved methods of identifying somatic and germline variants. Germline variants, inherited from parents, are typically present in every (or nearly every) cell and typically remain consistent across tissue types. In contrast to germline variants, somatic variants often arise in a post-zygotic event or later in life and are often confined to specific tissues or cell type. Somatic variants are often associated with cancers or other diseases. Some existing systems differentiate between somatic and germline variants by comparing genetic data from different sample types and leveraging different bioinformatic algorithms to identify and distinguish between germline and somatic variants. For example, some existing sequencing systems identify somatic variants by comparing tumor or affected tissue samples with matched normal tissue from the same individual. By comparing and contrasting datasets from affected tissue sample versus normal tissue sample, some existingsequencing systems can isolate somatic variants exclusive to affected tissues while excluding germline variants that are present in both types of samples.

[0004] Despite these recent advances, existing sequencing systems face several technical shortcomings in accurately and efficiently identifying somatic variants. In particular, many existing sequencing systems face challenges in accurately identifying somatic variants in fluid samples due to technical and biological limitations. Fluid samples, such as blood and saliva, often contain mixtures of deoxyribonucleic acid (DNA) from both normal and potentially mutated cells, thereby making it difficult to isolate germline DNA from a fluid sample for comparison with tumor DNA from the fluid sample. Many existing sequencing systems cannot accurately identify somatic variants when fluid samples contain a substantial amount of germline DNA from normal blood cells, which dilute the somatic DNA from mutated or cloned cells and can diminish a somatic-DNA-data signal. Furthermore, the number of clone-derived cells can be relatively low in some samples, particularly in early-stage cancers, thereby leading to relatively lower variant allele fractions (VAFs) for candidate somatic variants for which many existing sequencing systems lack models to detect. Failures of existing sequencing systems in identifying somatic variants from cells in early-stage conditions, such as blood cells in a state of clonal hematopoiesis (CH), can significantly impact clinical outcomes for an individual and lead to delayed or less effective treatment.

[0005] Due in part to difficulties in differentiating normal cells from mutated cells that are mixed together in fluid samples, some existing sequencing systems lack models to accurately detect somatic variants present in smaller-sized fluid samples or low-purity or impure samples in which pre-tumorous cells are dispersed. For example, somatic variants often occur in only a subset of cells, thereby leading to a low variant allele fraction (VAF) within a blood, liquid, or other low-purity sample. Somatic variants presenting low VAFs often result in missed variants (e.g., false negatives) or misinterpretations of a variant’s significance. While some existing systems can successfully identify somatic variants that occur in several individuals, these same existing systems may miss call somatic variants that occur in small numbers (e.g., one to three) of individuals within a population. Many existing sequencing systems further fail to classify variants as somatic if the VAF of the variants are relatedly higher (e.g., > 0.35 or 0.40 VAF) or resemble germline variant VAFs because such somatic variants can seem indistinguishable from germline variants in data generated by existing sequencing systems. For example, some deleterious somatic variants may trigger mutated cells to rapidly proliferate, leading to relatively higher VAFs (e.g., 0.3-0.5) that resemble germline variant VAFs. A relatively higher VAF for a candidate variant may cause many existing systems to erroneously classify variants as germline variants. Accordingly, many existingsystems struggle to accurately detect somatic variants within small sample sizes or with VAFs that mimic germline variants.

[0006] To overcome challenges in accurately detecting somatic variants in liquid or impure solid samples, some existing sequencing systems require relatively higher DNA sample material to achieve sufficient sequencing depth to yield more accurate results or to run targeted gene panels that target specific genes known to increase the likelihood of CH. In liquid or impure samples, somatic variants tend to exist on a limited number of nucleotide reads. While many existing systems can successfully identify somatic variants using solid or pure samples, they often struggle to identify somatic variants in liquid or impure samples. Accordingly, many existing sequencing systems fail to identify somatic variants from impure samples without additional DNA from a sample or sample volume. The requirement of higher sample material often results in inflexible and sometimes inaccurate operation while detecting somatic variants. For instance, many existing sequencing systems analyzing DNA from liquid samples require relatively more sample volume to process sufficient genetic material in a sequencing device to ensure that the sequencing device produces nucleotide reads that map to genes or target genomic regions at higher depths, thereby increasing the accuracy or sensitivity of VAF at particular regions. Further, in some cases, existing sequencing systems require two fluid samples (or a larger fluid sample) sufficient to perform different sequencing runs and corresponding variant-call analyses. First, an existing sequencing system may perform one sequencing run on a sequencing device with DNA from a fluid sample (e.g., blood) of a subject and, based on reads from the run, perform germline variant calling for the sample. Second and additionally, an existing sequencing system may perform another sequencing run on a sequencing device with DNA from a different fluid sample (e.g., saliva or sputum) of the subject and, based on reads for the additional run, perform somatic variant calling for targeted regions of the genome. Such targeted regions may include the genes for DNA methyltransferase 3 alpha (DNMT3A) and / or Tet methylcytosine dioxygenase 2 (TET2), as part of a targeted gene panel for genes in which variants are known to increase the likelihood of CH. However, the higher DNA sample required as an input by existing sequencing systems to detect somatic variants in fluid samples can significantly limit the number of samples eligible for somatic detection and requires slow turnaround times for somatic variant detection relative to other assays.

[0007] These along with additional problems and issues exist in existing sequencing systems.SUMMARY

[0008] This disclosure describes one or more embodiments of systems, methods, and non-transitory computer readable storage media that solve one or more of the problems described above or provide other advantages over the art. In particular, the disclosed systems can use an improvedsomatic-variant calling approach to detect somatic variants in a sample across candidate genes, such as candidate genes with known variants resulting in clonal hematopoiesis (CH). For example, the disclosed system can generate variant calls for a genomic sample, such as samples from liquid tissue or impure solid tissue, and enrich such variant-call data to identify somatic variants using a series of genetic thresholds and phasing factors. As part of such a series of genetic thresholds, the disclosed system can identify candidate somatic variants satisfying thresholds for variant allele fractions (VAFs), such as median or mode VAFs determined across multiple samples. The disclosed system can further perform read-based phasing according to germline variants to identify candidate somatic variants that segregate into a portion of one parental haplotype. The disclosed system can further evaluate such candidate somatic variants by determining that the VAF of the potential somatic variants deviate from VAFs of a subset of germline variants. Based on a candidate somatic variant exhibiting a VAF that satisfies threshold VAFs, phasing that segregates into a portion of phased nucleotide reads for one parental haplotype, and the candidate somatic variant’s VAF deviating from a VAF of local germline variants, the disclosed system can more accurately detect a somatic variant relative to existing systems.

[0009] Additional features and advantages of one or more embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments.BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The detailed description refers to the drawings briefly described below.

[0011] FIG. 1 illustrates an environment in which a somatic-variant calling system can operate in accordance with one or more embodiments of the present disclosure.

[0012] FIG. 2 illustrates an overview of the distribution of potential somatic variants across nucleotide reads in accordance with one or more embodiments of the present disclosure.

[0013] FIGS. 3A-3C illustrate the somatic-variant calling system performing multiple evaluations to determine whether a candidate variant is a somatic variant in accordance with one or more implementations of the present disclosure.

[0014] FIG. 4 illustrates the somatic-variant calling system determining that a candidate variant satisfies one or more threshold VAFs in accordance with one or more embodiments of the present disclosure.

[0015] FIG. 5 illustrates the somatic-variant calling system determining whether a candidate variant is a somatic variant based on a threshold pathogenicity score in accordance with one or more embodiments of the present disclosure.

[0016] FIG. 6 illustrates the somatic-variant calling system determining, from phased nucleotide reads, that the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype in accordance with one or more embodiments of the present disclosure.

[0017] FIG. 7 illustrates the somatic-variant calling system determining that a candidate variant is a somatic variant based on determining that the VAF of the candidate variant deviates from VAFs of neighboring germline variants in accordance with one or more embodiments of the present disclosure.

[0018] FIG. 8 illustrates an example decision flowchart by which the somatic-variant calling system determines a candidate variant is a somatic variant in accordance with one or more embodiments of the present disclosure.

[0019] FIG. 9 illustrates a box plot portraying how the somatic-variant calling system accurately identifies individuals that carry CH somatic variants as demonstrated by the ages of the identified carriers in accordance with one or more embodiments of the present disclosure.

[0020] FIG. 10 illustrates a bar graph demonstrating that the somatic-variant calling system accurately produces expected results in identifying CH somatic variants associated with key genes in accordance with one or more implementations of the present disclosure.

[0021] FIG. 11 illustrates a bubble plot portraying increased risk of certain diseases and conditions that the somatic- variant calling system associates with specific gene variants in accordance with one or more embodiments of the present disclosure.

[0022] FIG. 12 illustrates a stacked bar chart demonstrating that the somatic-variant calling system can accurately identify somatic variants of various clone size.

[0023] FIG. 13 illustrates a flowchart of a series of acts for classifying a candidate variant as a somatic variant in accordance with one or more embodiments of the present disclosure.

[0024] FIG. 14 illustrates a block diagram of an example computing device for implementing one or more embodiments of the present disclosure.DETAILED DESCRIPTION

[0025] This disclosure describes embodiments of a somatic-variant calling system that can use an improved somatic-variant calling approach that applies genetic thresholds and phasing factors to detect somatic variants in a sample’s variant-call and read data. By analyzing nucleotide reads generated by a sequencing device for a target genomic region of a sample, for example, the somatic-variant calling system can generate variant calls for the target genomic region of the sample. From among the variant calls of such variant-call data, the somatic-variant calling system determines that a candidate variant satisfies one or more threshold variant allele fractions (VAFs) based on a variant allele fraction (VAF) of the candidate variant among the nucleotide reads. By further phasing thenucleotide reads for the target genomic region, the somatic-variant calling system can determine the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype. As a further threshold, the somatic-variant calling system can further determine that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold. Such a subset of germline variants can be located within a threshold number of nucleobases of the candidate variant. Based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold, the somatic-variant calling system can further classify the candidate variant as a somatic variant.

[0026] As indicated above, in some embodiments, the somatic- variant calling system uses different genetic thresholds for evaluating candidate variants. In some embodiments, the somatic-variant calling system determines whether a candidate variant satisfies one or more threshold VAFs. In some examples, the somatic-variant calling system can rely on the assumption or statistical observation that heterozygous germline variants likely have around a 0.5 VAF and homozygous germline variants likely have around a 1.0 VAF. Based on the statistical observations for heterozygous and homozygous germline variants, the somatic-variant calling system can set a lower threshold VAF (e.g., VAF of 0.30, 0.35, 0.40) for heterozygous candidate variants more likely being somatic. By determining that a candidate variant’s VAF is less than the lower threshold VAF, the somatic-variant calling system can determine such a candidate variant is more likely somatic.

[0027] In addition to a VAF threshold and as indicated above, in some embodiments, the somatic-variant calling system evaluates candidate variants by determining whether a candidate variant segregates into a portion of phased nucleotide reads for one parental haplotype. For instance, the somatic-variant calling system can use known germline variants (e.g., heterozygous single nucleotide variants) to phase nucleotide reads or group nucleotide reads of a sample according to parental haplotype. Because somatic variants arise in a post-zygotic event or otherwise later in life, the somatic-variant calling system can determine that somatic variants likely present (i) within nucleotide reads phased according to a single parental haplotype (e.g., maternal or paternal haplotype) and (ii) within only a portion of such phased nucleotide reads for the single parental haplotype. Accordingly, the somatic-variant calling system can identify a candidate variant that segregates into a portion of phased nucleotide reads for a single parental haplotype as more likely a somatic variant.

[0028] Independent of such haplotype phasing, in one or more embodiments, the somatic-variant calling system further evaluates candidate variants by determining whether the VAF of thecandidate variant deviates from VAFs of neighboring or local germline variants. Such local germline variants can be located within tens, hundreds, or thousands of nucleobases of the candidate variant. In particular, some variants may be somatic but have VAFs that are similar to germline variant VAFs. For instance, a somatic variant may coincidentally have a VAF approximating 0.5, which would normally indicate a heterozygous germline variant. By determining whether a candidate variant’s VAF deviates from the VAFs of local germline variants within a deviation threshold, in some examples, the somatic-variant calling system determines whether the candidate variant is more likely somatic.

[0029] In addition or in the alternative to VAF thresholds, in some embodiments, the somatic-variant calling system performs additional evaluations to determine whether a candidate variant is somatic. For example, the somatic-variant calling system can determine that a candidate variant is more likely somatic based on a pathogenicity score of the candidate variant. When a candidate variant exhibits a pathogenicity score (e.g., PrimateAI score, PrimateAI-3D score, CADD score) that satisfies a threshold pathogenicity score, the somatic-variant calling system can determine that a candidate variant is more likely a somatic variant.

[0030] To apply the genetic thresholds and phasing factors noted above, in some cases, the somatic-variant calling system uses a unique tagging approach to identify candidate somatic variants as somatic variants in genes known to include mutations resulting in clonal hematopoiesis (CH). After identifying CH genes, the somatic-variant calling system can identify which candidate variants qualify as somatic variants using a number of tags or labels. In particular, the somatic-variant calling system can assign or omit (1) a central -tendency -VAF tag or label based on the candidate variant satisfying one or more threshold VAFs, such as a mean, median, or mode VAF across multiple samples; (2) a phasing tag indicating that the candidate variant is more likely somatic based on determining that the candidate variant segregates into a portion of phased nucleotide reads according to one parental haplotype; and (3) a local-germline- VAF tag based on determining that the VAF of the candidate variant deviates from VAFs of local germline variants by a deviation threshold. In some cases, the somatic-variant calling system further assigns (4) a pathogenicity tag based on a candidate variant exhibiting a pathogenicity score satisfying a threshold pathogenicity score.

[0031] Based on one or more of the tags (1 )-(4), the somatic-variant calling system classifies candidate variants as somatic variants. For instance, if the somatic- variant calling system labels a candidate variant with each of (1) a central -tendency -VAF tag, (2) aphasing tag, (3) a pathogenicity tag, and (4) local-germline-VAF tag, the candidate variant qualifies as a somatic variant. But the somatic-variant calling system can also apply certain tags as “rescue” tags. If, for example, the somatic-variant calling system does not label a candidate variant with the (1) central-tendency-VAF tag or the (3) pathogenicity tag — but labels the candidate variant with the (4) local-germline-VAF tag — the somatic-variant calling system still labels the candidate variant as a somatic variant. Alternative examples of such “rescue” tags are described further below.

[0032] As indicated above, the somatic-variant calling system provides several technical advantages relative to existing sequencing systems by, for example, improving identification of somatic variants in fluid samples, improving the accuracy of somatic variant detection, and decreasing the required amount of DNA input for a sample and turnaround times. In particular, the somatic-variant calling system can more accurately detect somatic variants using fluid samples relative to existing sequencing systems. In contrast to many existing systems that rely on the collection of both tumor samples and non-tumor samples to identify shared germline variants, the somatic-variant calling system can evaluate candidate variants using a single sample. Such a single sample may include a blood sample, other liquid sample, or an impure solid sample in which pre-tumorous cells are dispersed and cannot be easily dissected. Indeed, the somatic-variant calling system can apply the various genetic thresholds and phasing factors noted above and below to variant-call data that can be derived from a blood sample. To illustrate, the somatic-variant calling system can use known local germline variants as points for comparison for candidate variants. For instance, the somatic-variant calling system can determine whether a candidate variant segregates into a portion of phased nucleotide reads for one parental haplotype based on known parental germline variants and identify such a candidate variant as somatic. Furthermore, the somatic-variant calling system can identify local germline variants within a threshold number of bases of candidate variants, evaluate whether a candidate variant’s VAF deviates significantly from VAFs of the local germline variants, and identify as somatic such a candidate variant that deviates more than a deviation threshold. Because the somatic-variant calling system can detect somatic variants regardless of somatic cell levels or clone size within a sample that limit some existing sequencing systems, the somatic-variant calling system can also detect early somatic variants. As further shown by FIG. 11 and described below, the somatic-variant calling system can flag somatic-variant-bearing samples for clinicians, display an indicator when a somatic variant shows a subject has clonal hematopoiesis of indeterminate potential (CHIP), and alert such clinicians to enact prophylactic measures.

[0033] In addition to improving somatic-variant detection in fluid samples, the somatic- variant calling system can more accurately detect somatic variants relative to existing sequencing systems by incorporating multiple genetic thresholds and phasing factors. In contrast to existing systems that sometimes fail to detect somatic variants because the variants’ VAF is too low or because the VAFs resemble germline variant VAFs — or classifying a variant as germline or somatic based on a preset VAF — the somatic-variant calling system can analyze multiple factors to determine acandidate variant is somatic or germline. For instance, the somatic-variant calling system can identify candidate variants as somatic when a candidate variant exhibits a VAF that satisfies one or more threshold VAFs, phases into a portion of nucleotide reads for one parental haplotype, and / or exhibits a VAF that deviates from a VAF of local germline variants. Such factors may likewise “rescue” or detect candidate variant that are somatic missed by existing systems. By rescuing such candidates, the somatic-variant calling system can more accurately detect both common and rare somatic variants. If a candidate variant segregates into a portion of phased nucleotides reads for a parental haplotype, for instance, the somatic- variant calling system can classify the candidate variant as somatic — even when the candidate variant does not satisfy VAF thresholds or pathogenicity-score thresholds. When the VAF of a candidate variant deviates significantly from VAFs of local germline variants, on the other hand, the somatic-variant calling system can classify the candidate variant as somatic — even when the candidate variant does not satisfy other VAF, phasing, or pathogenicity-score thresholds. As further shown by FIGS. 9, 10, and 12 and described below, the somatic-variant calling system improves the reliability and sensitivity of detecting somatic variants in a sample relative to existing sequencing systems.

[0034] Beyond improved accuracy in detecting somatic variants, in some cases, the somatic-variant calling system can decrease the sample input relative to existing sequencing systems. As mentioned, some existing sequencing systems cannot accurately identify somatic variants from liquid samples or impure solid samples without additional input DNA. Unlike existing sequencing systems that require relatively higher sample volume from a liquid or impure sample to determine or evaluate a candidate variant, the somatic-variant calling system is adept at detecting and distinguishing germline or somatic variants that exist on relatively few nucleotide reads and, thus, requires less sample volume relative to existing sequencing systems, even from liquid or impure samples. The somatic-variant calling system can leverage relatively less DNA from a liquid sample or impure solid sample to detect somatic variants for a subject and / or determine a likelihood of the presence of a somatic variant in the subject. In some embodiments, for instance, the somatic-variant calling system can leverage whole genome sequencing (WGS) or whole exome sequencing (WES) to identify somatic variants, even if they exist on a limited number of nucleotide reads. In particular, the somatic-variant calling system can sequence nucleotide reads, determine variant calls in variant-call data for the sample, and determine which samples exhibit somatic variants based on one or more of the genetic thresholds or phasing factors described above or below. Based on such nucleotide reads and variant-call data from a relatively DNA-lite process, for example, the somatic-variant calling system can determine a candidate variant is somatic or germline when such a candidate variant exhibits a VAF that satisfies threshold VAFs, phases into a portion of nucleotide reads for one parental haplotype, and / or exhibits a VAF that deviates from a VAF of local germlinevariants. Because the somatic-variant calling system can detect somatic variants regardless of somatic cell levels or clone size, the somatic-variant calling system does not require additional sample volume for liquid or impure samples. The relatively smaller genetic material required for input in detecting somatic variants for an initial sample — and / or genetic material required by samples for performing separate sequencing runs and variant calling for germline variants and, separately, somatic variants as part of a targeted gene panel — increases the number of eligible samples for somatic variant testing relative to existing sequencing systems.

[0035] By leveraging the speed of WGS or WES, in some embodiments, the somatic- variant calling system can rely on a high-throughput sequencing device to identify supporting nucleotide reads more quickly from a sample that exhibits somatic variants relative to existing sequencing systems. For example, the somatic-variant calling system can directly utilize the nucleotide reads generated by WGS or WES without going through the lengthy and resource heavy design, manufacture, and validation processes utilized by existing sequencing systems. Further, and by contrast, the somatic-variant calling system can also perform somatic variant calling as part of a targeted gene panel using the disclosed genetic thresholds or phasing factors instead of relying on WGS or WES.

[0036] As suggested by the foregoing discussion, this disclosure utilizes a variety of terms to describe features and benefits of the somatic-variant calling system. Additional detail is hereafter provided regarding the meaning of these terms as used in this disclosure. As used in this disclosure, for instance, the term “genomic sample” (or simply “sample”) refers to a specimen, culture, or the like that is suspected of including a target nucleic acid. In some embodiments, the sample comprises DNA, ribonucleic acid (RNA), peptide nucleic acid (PNA), locked nucleic acid (LNA), chimeric or hybrid forms of nucleic acids as targets. The genomic sample can likewise include any biological, clinical, surgical, agricultural-atmospheric, or aquatic-based specimen containing one or more nucleic acids. A genomic sample also includes any isolated or extracted nucleic acid sample from an organism, such a genomic DNA, fresh-frozen, or formalin-fixed paraffin-embedded nucleic acid specimen. In some cases, accordingly, a genomic sample can include a full genome or partial genome that is isolated or extracted (e.g., in whole or in part by a kit) from an organism and that is prepared to undergo sequencing or an assay in a sequencing device. A genomic sample can be from a single individual, a collection of nucleic acid samples from genetically related members, nucleic acid samples from genetically unrelated members, nucleic acid samples (matched) from a single individual such as a tumor sample and normal tissue sample, or sample from a single source that contains two distinct forms of genetic material, such as maternal and fetal DNA obtained from a maternal subj ect, or the presence of contaminating bacterial DNA in a sample that contains plant or animal DNA. In some embodiments, the source of nucleic acid material caninclude nucleic acids obtained from a newborn, for example as typically used for newborn screening.

[0037] The genomic sample can include high molecular weight material, such as genomic DNA (gDNA). The sample can include low molecular weight material such as nucleic acid molecules obtained from FFPE or archived DNA samples. In another implementation, low molecular weight material includes enzymatically or mechanically fragmented DNA. The sample can include cell-free circulating DNA. In some implementations, the sample can include nucleic acid molecules obtained from biopsies, tumors, scrapings, swabs, blood, mucus, urine, plasma, semen, hair, laser capture micro-dissections, surgical resections, and other clinical or laboratory obtained samples. In some implementations, the sample can be an epidemiological, agricultural, forensic, or pathogenic sample. In some implementations, the sample can include nucleic acid molecules obtained from an animal such as a human or mammalian source. In another implementation, the sample can include nucleic acid molecules obtained from a non-mammalian source, such as a plant, bacteria, virus, or fungus. In some implementations, the source of the nucleic acid molecules may be an archived or extinct sample or species.

[0038] As further used herein, the term “genomic coordinate” (or sometimes simply “coordinate”) refers to a particular location or position of a nucleobase within a genome (e.g., an organism’s genome or a reference genome). In some cases, a genomic coordinate includes an identifier for a particular chromosome of a genome and an identifier for a position of a nucleobase within the particular chromosome. For instance, a genomic coordinate or coordinates may include a number, name, or other identifier for a somatic or sex chromosome (e.g., chrl or chrX) and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570 or chrl: 1234570-1234870). In some cases, a genomic coordinate refers to a genomic coordinate on a sex chromosome (e.g., chrX or chrY). Consequently, the somatic-variant calling system can determine genotype probabilities for a genotype call (e.g., a variant call) for a genomic coordinate on a sex chromosome. Further, in certain implementations, a genomic coordinate refers to a source of a reference genome (e.g., mt for a mitochondrial DNA reference genome or SARS-CoV-2 for a reference genome for the SARS-CoV-2 virus) and a position of a nucleobase within the source for the reference genome (e.g., mt: 16568 or SARS-CoV-2:29001). By contrast, in certain cases, a genomic coordinate refers to a position of a nucleobase within a reference genome without reference to a chromosome or source (e.g., 29727).

[0039] As further used herein, the term “target genomic region” refers to a particular location or sequence within a genomic sample. In particular, a target genomic region comprises a specific segment of a genomic sample that is selected for detailed analysis. A target genomic region may include genes, variants, or other DNA sequences of interest. For example, a target genomic regioncan comprise one or more variants, including a candidate variant for evaluation. In some cases, a target genomic region includes a range of genomic coordinates. Like genomic coordinates, in certain implementations, a target genomic region may be identified by an identifier for a chromosome and a particular position or positions, such as numbered positions following the identifier for a chromosome (e.g., chrl: 1234570-1234870). In various implementations, a target genomic coordinate includes a position within a reference genome. In some cases, a target genomic coordinate is specific to a particular reference genome.

[0040] As used herein, the term “sequencing device” refers to an instrument or platform used to perform a sequencing process. In particular, a sequencing device refers to an instrument or platform used to perform a sequencing process based on sequencing by synthesis (SBS) technology, single-molecule real-time sequencing (SMRT) technology using magnetic beads or nanopores or other suitable medium. For example, a sequencing device may comprise components including, but not limited to, flow cell receptacle, fluidics systems, lasers, imaging systems, and computational capabilities for acquiring, processing, and analyzing image data during a sequencing run.

[0041] As used herein, the term “variant call” refers to a nucleobase call comprising a mutation or a variant at a particular genomic coordinate or genomic region with respect to a reference. In particular, a variant call includes a determination or prediction that a genomic sample comprises a particular nucleobase (or sequence of nucleobases) at a genomic coordinate or region that differs from a reference nucleobase (or sequence of reference nucleobases) at the same genomic coordinate or region within a reference genome.

[0042] As used herein, the term “variant allele fraction” (or simply “VAF”) refers to a measure of a proportion of a specific variant present within a genomic sample relative to nucleotide sequences at the same genomic location. In particular, a VAF refers to a proportion of nucleotide reads that contains a specific variant compared to the total number of nucleotide reads from a genomic sample covering or spanning a target genomic region. VAF can help indicate the presence and proportion of a variant within a genomic sample, aiding in distinguishing between somatic and germline mutations or variants. In one example, if a genomic sample comprises 100 nucleotide reads at a target genomic region, and 25 of the nucleotide reads contain a given variant, the VAF of the given variant would be 0.25 (or 25%).

[0043] As further used herein, the term “nucleotide read” (or simply “read”) refers to an inferred sequence of one or more nucleobases (or nucleobase pairs) from all or part of a sample nucleotide sequence (e.g., a sample genomic sequence, complementary DNA). In particular, a nucleotide read includes a determined or predicted sequence of nucleobase calls for a nucleotide sequence (or group of monoclonal nucleotide sequences) from a sample library fragmentcorresponding to a genomic sample. For example, in some cases, a sequencing device determines a nucleotide read by generating nucleobase calls for nucleobases passed through a nanopore of a nucleotide-sample slide, determined via fluorescent tagging, or determined from a cluster in a flow cell. In some cases, a nucleotide read can refer to a particular type of read, such as a nucleotide read synthesized from sample library fragments that are shorter than a threshold number of nucleobases (e.g., SBS reads). In these or other cases, another type of nucleotide read can refer to (i) assembled nucleotide reads that have been assembled from shorter nucleotide reads to form a contiguous sequence (e.g., assembled nucleotide reads) satisfying a threshold number of nucleobases, (ii) circular consensus sequencing (CCS) reads satisfying the threshold number of nucleobases, or (iii) nanopore long reads satisfying the threshold number of nucleobases.

[0044] As further used herein, the term “phasing” (or “haplotype phasing”) refers to a process of separating nucleotide reads of one or more samples into respective parental haplotypes. For instance, phasing can occur by identifying unique alleles or variants (e.g., SNPs, indels) on nucleotide reads in a genomic region, organizing such nucleotide reads in the genomic region according to the unique alleles or variants, and identifying subsets of nucleotide reads according to a maternal haplotype or paternal haplotype based on the organization or grouping. In some cases, a haplotype phasing model that uses a hidden Markov model (HMM) or another algorithm that can be used to perform haplotype phasing, such as Segmented HAPlotype Estimation and Imputation Tool (SHAPEIT), BEAGLE, Eagle2, or WhatsHap.

[0045] Relatedly, as further used herein, the term “phased nucleotide read” refers to nucleotide reads that have been assigned to parental haplotypes or chromosomes by phasing. In particular, phased nucleotide reads have been phased by identifying alleles (or variants) from each parent and organizing the nucleotide reads to show which specific variants are inherited together from each parent. Phased nucleotide reads make it possible to distinguish between maternal and paternal haplotypes of each chromosome by grouping variants that are inherited together on one chromosome.

[0046] As also used herein, the term “haplotype” refers to nucleotide sequences that are present in an organism (or present in organisms from a population) and inherited from one or more ancestors. In particular, a haplotype can include alleles or other nucleotide sequences present in organisms of a population and inherited together by such organisms respectively from a single parent. In one or more embodiments, haplotypes include a set of SNPs on the same chromosome that tend to be inherited together. In some cases, data representing a haplotype or a set of different haplotypes are stored or otherwise accessible on a haplotype database.

[0047] Relatedly, as used herein, the term “parental haplotype” refers to a group of genes within an organism that was inherited together from a single parent. In particular, parental haplotypesinclude distinct sets of variants or alleles present in organisms of a population and inherited together by such organisms respectively from a single parent. For example, each diploid organism inherits one paternal haplotype and one maternal haplotype for each chromosome pair. In some examples, sequencing systems can phase nucleotide reads to identify whether variants segregating into nucleotide reads for a paternal haplotype or a maternal haplotype.

[0048] As used herein, the term “germline variant” refers to a variant or mutation inherited by a sample organism from biological parents or present within germ cells. In particular, a germline variant is a heritable variant that tends to be present in every somatic and germline cell of offspring. Germline variants are accordingly inherited variants and are theoretically present in every cell of an individual’s body. A well-known example of a germline variant is a mutation in the BRCA1 or BRCA2 genes, which can be inherited from either parent as is associated with an increased risk of breast and ovarian cancers.

[0049] As used herein, the term “somatic variant” refers to a variant or mutation that is acquired, rather than inherited, as part of a post-zygotic event. In particular, a somatic variant often results from environmental factors, cellular replication errors, or aging, and can influence cell behavior in ways that may contribute to diseases. Accordingly, a somatic variant includes a variant or mutation that was introduced after zygote formation during cell development or later in life, but is not inherited from a sample organism’s biological parents. In some embodiments, a somatic variant comprises a clonal variant that leads to the expansion of a cell clone. For example, a somatic variant can result in clonal hematopoiesis (CH), a condition in which blood stem cells acquire mutations that give rise to genetically distinct populations of blood cells, often increasing with age and associated with an elevated risk of blood cancers and cardiovascular disease.

[0050] As further used herein, the term “pathogenicity” refers to the ability or tendency of a biological molecule to contribute to disease within a host organism. In particular, pathogenicity can refer to the ability or tendency of a biological molecule to cause or lead to the susceptibility of disease within the host organism. For example, pathogenicity can refer to the ability or tendency of a nucleotide variant encoding an amino acid (or an amino acid) to cause a protein including the amino acid to function in a manner that causes disease within a host organism or leads to susceptibility of the host organism to disease. More specifically, in some cases, pathogenicity refers to the ability or tendency of a nucleotide variant or an amino-acid variant within a protein to change the function and / or structure of the protein such that the protein causes or leads to susceptibility to disease.

[0051] Relatedly, the term “pathogenicity score” refers to a score that is indicative of pathogenicity. In particular, a pathogenicity score can refer to a score that indicates that pathogenicity of a nucleotide or an amino acid within a target protein sequence. For instance, insome cases, a pathogenicity score includes a score indicating the pathogenicity of a nucleotide variant or an amino-acid variant included in the target protein sequence. In some instances, a pathogenicity score includes a numerical value where a relatively higher value indicates relatively higher pathogenicity and a relatively lower value indicates relatively lower pathogenicity or vice versa. In some embodiments, a pathogenicity score provides a direct measure of pathogenicity (e.g., the score indicates the level of pathogenicity). In some instances, however, a pathogenicity score provides an indirect measure of pathogenicity, such as by measuring some other characteristic related to pathogenicity (e.g., the depletion of observed variants).

[0052] The following paragraphs describe the somatic-variant calling system with respect to illustrative figures that portray example embodiments and implementations. For example, FIG. 1 illustrates a schematic diagram of a computing system 100 in which a somatic-variant calling system 106 operates in accordance with one or more embodiments. As illustrated, the computing system 100 includes a sequencing device 102 connected to a local device 108 (e.g., a local server device), one or more server device(s) 110, and a client device 114. As shown in FIG. 1, the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114 can communicate with each other via a network 118. The network 118 comprises any suitable network over which computing devices can communicate. Example networks are discussed in additional detail below with respect to FIG. 14. While FIG. 1 shows an embodiment of the somatic-variant calling system 106, this disclosure describes alternative embodiments and configurations below.

[0053] As indicated by FIG. 1, the sequencing device 102 comprises a computing device and a sequencing device system 104 for sequencing a genomic sample or other nucleic-acid polymer. In some embodiments, by executing the sequencing device system 104 using a processor, the sequencing device 102 analyzes nucleotide fragments or oligonucleotides extracted from genomic samples to generate nucleotide reads or other data utilizing computer implemented methods and systems either directly or indirectly on the sequencing device 102. More particularly, the sequencing device 102 receives nucleotide-sample slides (e.g., flow cells) comprising nucleotide fragments extracted from samples and further copies and determines the nucleobase sequence of such extracted nucleotide fragments.

[0054] In one or more embodiments, the sequencing device 102 utilizes sequencing-by-synthesis (SBS) techniques to sequence nucleotide fragments into nucleotide reads and determine nucleobase calls for the nucleotide reads. In addition, or in the alternative to communicating across the network 118, in some embodiments, the sequencing device 102 bypasses the network 118 and communicates directly with the local device 108, and / or the client device 114. By executing the sequencing device system 104, the sequencing device 102 can further store the nucleobase calls aspart of a base-call data file that is formatted as a binary base call (BCL) file and / or a FASTQ file and send the BCL file and / or FASTQ file to the local device 108, and / or the server device(s) 110.

[0055] As further indicated by FIG. 1, the local device 108 is located at or near a same physical location of the sequencing device 102. Indeed, in some embodiments, the local device 108 and the sequencing device 102 are integrated into a same computing device. The local device 108 may run the sequencing device system 104 and / or the somatic-variant calling system 106 to generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data. As shown in FIG. 1, the sequencing device 102 may send (and the local device 108 may receive) base-call data generated during a sequencing run of the sequencing device 102. The local device 108 may also communicate with the client device 114. In particular, the local device 108 can send data to the client device 114, including a binary alignment map (BAM) file, a variant call format (VCF) file, or other information indicating nucleobase calls, sequencing metrics, error data, or other metrics. In some implementations, the local device 108 can send, to the client device 114, an indication that a subject corresponding with a genomic sample has a condition based on sequencing data.

[0056] As further indicated by FIG. 1, the server device(s) 110 are located remotely from the local device 108 and the sequencing device 102. Like the local device 108, in some embodiments, the server device(s) 110 include a version of (or are otherwise able to access or implement) the somatic-variant calling system 106. For example, the server device(s) 110 can implement the somatic-variant calling system 106 as part of a sequencing system 112. Accordingly, the server device(s) 110 may generate, receive, analyze, store, and transmit digital data, such as by receiving base-call data or determining variant calls based on analyzing such base-call data. As indicated above, the sequencing device 102 may send (and the server device(s) 110 may receive) base-call data from the sequencing device 102. The server device(s) 110 may also communicate with the client device 114. In particular, the server device(s) 110 can send data to the client device 114, including BAM files, VCF files, or other sequencing related information.

[0057] In some embodiments, the server device(s) 110 comprise a distributed collection of servers where the server device(s) 110 include a number of server devices distributed across the network 118 and located in the same or different physical locations. Further, the server device(s) 110 can comprise a content server, an application server, a communication server, a web-hosting server, or another type of server.

[0058] As further illustrated and indicated in FIG. 1, by executing a sequencing application 116, the client device 114 can generate, store, receive, and send digital data. In particular, the client device 114 can receive sequencing data from the local device 108 or receive base-call data files (e.g., BCL and FASTQ) and sequencing metrics from the sequencing device 102. Furthermore, theclient device 114 may communicate with the local device 108 or the server device(s) 110 to receive a VCF comprising genotype or variant calls and / or other metrics, such as base-call-quality metrics or pass-filter metrics. The client device 114 can accordingly present or display information pertaining to variant calls or other genotype calls within a graphical user interface of the sequencing application 116 to a user associated with the client device 114. For example, the client device 114 can present nucleobase calls, genotype calls, variant calls, and / or sequencing metrics for a sequenced genomic sample within a graphical user interface of the sequencing application 116.

[0059] Although FIG. 1 depicts the client device 114 as a desktop or laptop computer, the client device 114 may comprise various types of client devices. For example, in some embodiments, the client device 114 includes non-mobile devices, such as desktop computers or servers, or other types of client devices. In yet other embodiments, the client device 114 includes mobile devices, such as laptops, tablets, mobile telephones, or smartphones. Additional details regarding the client device 114 are discussed below with respect to FIG. 14.

[0060] As further illustrated in FIG. 1, the client device 114 includes the sequencing application 116. The sequencing application 116 may be a web application or a native application stored and executed on the client device 114 (e.g., a mobile application, desktop application). The sequencing application 116 can include instructions that (when executed) cause the client device 114 to receive data from the somatic-variant calling system 106 and present, for display at the client device 114, base-call data or data from an alignment data file or VCF. Furthermore, the sequencing application 116 can instruct the client device 114 to display summaries for multiple sequencing runs. Additionally, the sequencing application 116 can include instructions that (when executed) cause the client device 114 to present, for display at the client device 114, information pertaining to a subject corresponding with a genomic sample, such as an indicator that a subject corresponding with the genomic sample has clonal hematopoiesis of indeterminate potential (CHIP).

[0061] As further illustrated in FIG. 1, a version of the somatic-variant calling system 106 may be located and / or implemented (e.g., entirely or in part) on the client device 114 or the sequencing device 102. In yet other embodiments, the somatic-variant calling system 106 is implemented by one or more other components of the computing system 100, such as the local device 108. In particular, the somatic-variant calling system 106 can be implemented in a variety of different ways across the sequencing device 102, the local device 108, the server device(s) 110, and the client device 114. For example, the somatic-variant calling system 106 can be downloaded from the server device(s) 110 to the client device 114 and / or the local device 108 where all or part of the functionality of the somatic-variant calling system 106 is performed at each respective device within the computing system 100.

[0062] As mentioned previously, the somatic-variant calling system 106 can detect somatic variants based on certain traits of candidate variants. FIG. 2 provides an overview of the distribution of potential somatic variants across nucleotide reads in accordance with one or more embodiments of the present disclosure.

[0063] In some embodiments, the somatic-variant calling system 106 identifies somatic variants that are detected through whole exome sequencing (WES) or whole genome sequencing (WGS) of DNA from fluid samples (e.g., blood, saliva, etc.). For example, the somatic-variant calling system 106 can identify variants resulting in clonal hematopoiesis (CH) within blood-derived DNA. CH is a disease that leads to proliferations of hematopoietic stem cells. CH variants often confer a growth advantage to mutated cells, resulting in more growth relative to other cells and a population of blood cells with identical genetic changes. These CH variants are typically somatic variants in genes related to cell growth and survival. Accordingly, CH variants tend to permeate throughout blood, which makes the mutations detectable through the whole exome of DNA in blood. CH mutated cells can lead to several clinical outcomes, particularly as they expand and increase their influence within the blood cell population. For example, subjects having CH mutated cells exhibit clonal hematopoiesis of indeterminate potential (CHIP). While CHIP may not initially cause symptoms or immediate health problems, CHIP is considered a preclinical state by increasing the risk of developing hematologic cancers, such as leukemia, and is associated with an elevated risk of cardiovascular disease and overall mortality.

[0064] As mentioned, existing sequencing systems often face challenges in identifying somatic variants, such as CH variants. One primary challenge that existing sequencing systems face in identifying somatic variants from fluid samples is that somatic variants are often clonal and exist in a small proportion of blood. Thus, somatic variants may be less concentrated and more challenging to capture within a fluid sample relative to solid tumor samples. Furthermore, somatic variants found in blood can be difficult to differentiate from germline variants. In contrast to solid tumors that form localized masses that can be sampled and tested, single fluid samples cannot easily be divided into mutated and non-mutated cells. Therefore, genetic comparisons between mutated and non-mutated cells in fluid samples or impure solid samples are challenging. The somatic-variant calling system 106 can overcome challenges in detecting somatic variants in fluid samples by taking sequencing data generated for germline sequencing and comparing candidate variants with known germline variants. FIG. 2 illustrates a distribution of traits of germline variants that the somatic-variant calling system 106 can distinguish from potential somatic variants.

[0065] As illustrated by FIG. 2 nucleotide reads 202 comprise nucleotide reads that cover or partially cover the same target genomic region of a genomic sample. As suggested by FIG. 2, the somatic-variant calling system 106 can identify, from the nucleotide reads 202, reads that carry agermline variant 204. Because each cell typically has two copies of each gene, a normal inherited mutation has a VAF of around 50% (or 0.5). To illustrate, the germline variant 204 exists in approximately 50% of the nucleotide reads 202. The germline variant 204 illustrated in FIG. 2 comprises a heterozygous germline variant. Homozygous germline variants have VAFs that approximate or are nearly present in 100% of the nucleotide reads 202. But germline variants may have VAFs that deviate from 50% and 100% because of contaminated or mixed samples, sequencing errors, coverage variability, or other sequencing artifacts. Thus, existing sequencing systems that simply identify somatic variants as those deviating from 50% and 100% VAFs may be inaccurate and misidentify somatic variants.

[0066] As further illustrated in FIG. 2, the somatic-variant calling system 106 can identify a sequencing artefact 206 having a small VAF and distinguish such sequencing artifacts from germline or somatic variants. In particular, sequencing artefacts typically occur randomly and at low frequency across nucleotide reads, rather than consistently at the same position (e.g., genomic coordinate). Because sequencing artefacts are not true genetic variants, but artefacts introduced by the sequencing process, such as errors introduced during template preparation or base calling on a sequencing device, sequencing artifacts appear only sporadically among the total nucleotide reads.

[0067] FIG. 2 further illustrates a couple of potential somatic variants — that is, a potential somatic variant 208 and a potential somatic variant 210. As shown, potential somatic variants can have VAFs that typically range from very low percentages (e.g., 1-10%) to moderate levels (e.g., 30%-50%). For example, the potential somatic variant 208 has a low VAF and is found in a small proportion of the nucleotide reads 202. The potential somatic variant 208 can be challenging to detect and may indicate the presence or emergence of a clonal variant. In contrast, the potential somatic variant 210, which has a moderate VAF, is representative of variants that are established in clonal populations. For example, if the majority of cells in the sample have the potential somatic variant 210, the VAF will be close to 50% (as the potential somatic variant 210 is likely heterozygous). Additionally, some somatic variants may have relatively higher VAFs (between 50% and 70%), especially if, for example, the somatic variant is part of an allele that is overly amplified during sequencing or is homozygotic or hemizygotic.

[0068] While somatic variants have a wide range of potential VAFs, so do variants that are not likely somatic. For example, FIG. 2 illustrates an unlikely somatic variant 212 with a VAF of around 0.75. Generally, VAFs around 0.75 are considered unlikely somatic because the high VAF suggests that the variant is present in nearly all cells and on both copies of the chromosome in a significant proportion of them. In diploid cells, a 75% VAF could imply either a copy number variation or that it is a homozygous or compound heterozygous germline variant.

[0069] As suggested by FIG. 2, simply relying on VAF to detect somatic variants can lead to inaccurate results. Thus, the somatic-variant calling system 106 utilizes a number of genetic thresholds and phasing factors, in addition to analyzing common VAFs, to accurately detect somatic variants. FIGS. 3A-3C illustrate the somatic-variant calling system 106 utilizing such genetic thresholds and phasing factors to determine whether a candidate variant is a somatic variant in accordance with one or more implementations of the present disclosure.

[0070] As shown in FIG. 3A, the somatic-variant calling system 106 performs an act 302 of generating variant calls from nucleotide reads. In this example, the somatic-variant calling system 106 uses a sequencing device to generate nucleotide reads 306. For instance, the somatic-variant calling system 106 can use whole genome sequencing (WGS) or whole exome sequencing (WES) to generate the nucleotide reads 306. As explained above, the nucleotide reads 306 represent fragments of a genomic sample. The somatic-variant calling system 106 further aligns the nucleotide reads 306 to a reference genome 304. The reference genome 304 comprises a standardized DNA sequence that serves as a benchmark for comparison. More particularly, the somatic-variant calling system 106 maps each read to the corresponding location in the reference genome 304. The somatic-variant calling system 106 can compare the aligned nucleotide reads with the reference genome to identify differences between the nucleotide reads 306 and the reference genome 304.

[0071] As further indicated by FIG. 3 A, the somatic-variant calling system 106 identifies differences between nucleobases of the nucleotide reads 306 and the reference genome 304 as variants and thereby determines variant calls. In some implementations, the somatic-variant calling system 106 generates variant calls from variants by assigning confidence scores (e.g., Phred-scaled QUAL scores) to the variants and filtering out likely sequencing errors. For example, and as depicted in FIG. 3 A, the somatic-variant calling system 106 generates variant calls (depicted as dark segments) within the nucleotide reads 308 based on identified variants having confidence scores that satisfy a threshold confidence score value.

[0072] In some implementations, the somatic-variant calling system 106 generates variant call files comprising variant calls and corresponding data. In particular, in some implementations, the somatic-variant calling system 106 generates variant call files comprising candidate variants or variants that are potentially somatic. Additionally, in some implementations, the somatic- variant calling system 106 further filters variants to compensate for lack of comparative normal tissue and identify candidate variants. For example, in some implementations, the somatic-variant calling system 106 filters variants using a germline resource Variant Call Format (VCF) file and / or a Panel of Normals (PoN). The germline resource VCF comprises a file that contains information on germline variants for a genomic sample. The germline resource VCF can serve as a referenceresource. Furthermore, the somatic-variant calling system 106 can use a PoN comprising a collection of DNA samples from healthy individuals to help identify and fdter out technical artifacts and recurrent sequencing errors from variant calling analyses.

[0073] As further shown in FIG. 3 A, the somatic-variant calling system 106 performs an optional act 310 of determining that the candidate variant is in a known CH-associated region. In some cases, the somatic- variant calling system 106 filters variants to prioritize rare mutations associated with CH by narrowing down variants within a list of known CH-related regions. In particular, as part of performing the optional act 310, the somatic-variant calling system 106 determines that the candidate variant is in a known clonal-hematopoiesis-associated region in a threshold number of samples from a mutation database 314. In some embodiments, the mutation database 314 contains information about somatic variants found in cancer samples. For example, the mutation database 314 can comprise a Catalogue of Somatic Mutations in Cancer (COSMIC) database containing information about somatic mutations found in human cancer samples. In some implementations, the somatic-variant calling system 106 filters variants that are found in a threshold number of samples 316 within the mutation database 314. For instance, the somatic-variant calling system 106 can focus its analysis on variants present in 10 or more samples from the mutation database 314. Use of this filter both highlights variants that reoccur in cancerous tissues, which may further suggest pathogenic relevance, and reduces the amount of data to be analyzed. As shown in FIG. 3 A, the somatic-variant calling system 106 determines that a candidate variant is found within three samples, as an example of the threshold number of samples 316. Accordingly, the somatic-variant calling system 106 determines to filter the candidate variant to be part of further analysis in detecting somatic variants.

[0074] Furthermore, and as shown in FIG. 3 A, the somatic-variant calling system 106 performs the optional act 312 of determining that an allele frequency of the candidate variant satisfies a threshold population allele frequency. By performing this optional process, the somatic-variant calling system 106 can eliminate common germline variants that are unlikely to be pathogenic and instead focus on rare variants that may be more clinically relevant. The somatic-variant calling system 106 can thus improve efficiency by reducing the amount of genomic data to be analyzed for candidate somatic variants. The somatic-variant calling system 106 determines a threshold population allele frequency. For example, the somatic-variant calling system 106 can determine a threshold population allele frequency to identify candidate variants having an allele frequency (AF) less than 0.0005. In some cases, the threshold population allele frequency is another value, such as 0.0020, 0.0010, 0.0003, or 0.0001. Accordingly, any variant found in more than 0.05% of a population (or a different threshold AF) would be removed from being a candidate variant.

[0075] As further shown in FIG. 3 A, in some cases, the somatic-variant calling system 106 can perform the optional act 312 using data from a reference database 318. In particular, the reference database 318 may comprise the Genome Aggregation Database (gnomAD) including allele frequencies for millions of genetic variants across diverse populations. The somatic-variant calling system 106 can compare the observed AF of a candidate variant within a population to the frequency data from the reference database 318.

[0076] In some implementations, the somatic-variant calling system 106 modifies the threshold population AF (used as part of the optional act 312) based on whether the variant is in a known CH-associated regions (as part of the optional act 310). In particular, if the somatic-variant calling system 106 determines that the candidate variant is in a known CH-associated region, the somatic-variant calling system 106 can increase the threshold population AF as to capture more candidate variants in CH-associated regions. To illustrate, in some examples, if the somatic-variant calling system 106 determines that the candidate variant is in a known CH-associated region, the somatic-variant calling system 106 determines that the threshold population AF is 0.001 (or 0.1%) instead of 0.0005 (or 0.05%).

[0077] FIG. 3 A illustrates the somatic-variant calling system 106 filtering candidate variants based on their association with a known CH-associated region. In some implementations, the somatic-variant calling system 106 can perform the optional acts 310 and 312 while targeting other conditions. For example, the somatic-variant calling system 106 can determine whether a candidate variant is in regions associated with other conditions in which genetic variants in hematopoietic cells lead to abnormal cell growth, increased risk of blood cancers, or related disorders. Examples of conditions that the somatic-variant calling system 106 can target include, but are not limited to, Myelodysplastic Syndrome (MDS), Chronic Myelomonocytic Leukemia (CMML), Myeloproliferative Neoplasms (MPNs), aplastic anemia, Paroxysmal Nocturnal Hemoglobinuria (PNH), Acute Myeloid Leukemia (AML), lymphoid clonal disorders, and other conditions.

[0078] As illustrated in FIG. 3B, the somatic-variant calling system 106 performs an act 320 of determining that a candidate variant satisfies one or more threshold VAFs. In particular, the somatic-variant calling system 106 determines that a true somatic variant is likely to exist on some but not all haplotype nucleotide reads and is likely to be heterozygous. Accordingly, the somatic-variant calling system 106 can determine one or more threshold variant allele fractions to which to compare a VAF of the candidate variant. For example, in some implementations, the somatic-variant calling system 106 can determine that a candidate variant is likely a somatic variant based on the VAF for the candidate variant for a genomic sample is less than or equal to a lower threshold VAF of 0.30, 0.35, or 0.40. In some examples, the somatic-variant calling system 106 determines that a candidate variant is somatic based on a VAF of a candidate variant satisfying a mean VAFof 0.30, 0.34, or 0.40 across multiple genomic samples. Additionally or alternatively, in certain embodiments, the somatic-variant calling system 106 determines that a candidate variant is likely a somatic variant based on a central tendency VAF for the candidate variant (i) being less than or equal to a lower threshold VAF or (ii) being greater than or equal to an upper threshold VAF (e.g., 0.60, 0.65, 0.70). FIG. 4 and the corresponding discussion further describe howthe somatic-variant calling system 106 determines that a candidate variant satisfies one or more threshold VAFs in accordance with one or more embodiments of the present disclosure.

[0079] FIG. 3B further illustrates the somatic- variant calling system 106 performing an optional act 322 of determining that a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score. In some implementations, the somatic-variant calling system 106 generates pathogenicity scores for candidate variants. The pathogenicity scores comprise numerical values indicating the likelihood that a variant is pathogenic. As described above, in some examples, higher scores (e.g., closer to 1) suggest a greater probability of disease association. The somatic-variant calling system 106 can generate a pathogenicity score for the candidate variant and determine if the pathogenicity score satisfies a threshold pathogenicity score. For instance, in some examples, the somatic-variant calling system 106 determines that a candidate variant is more likely somatic if the pathogenicity score of the candidate variant is above a mean, median, or mode pathogenicity score of potential variants across a gene. For example, some machine-learning models determine pathogenicity scores for each potential nucleotide variant within a gene. The somatic-variant calling system 106 may determine that if, on average, variants in a particular gene score lower for pathogenicity, then a candidate variant scoring higher than the particular gene’s average pathogenicity score for variants can be considered potentially pathogenic. FIG. 5 and the corresponding discussion further detail the somatic- variant calling system 106 determining that the pathogenicity score of the candidate variant satisfies a threshold pathogenicity score in accordance with one or more embodiments of the present disclosure.

[0080] In some examples, evaluations performed as part of the act 320 and the optional act 322 fail to detect somatic variants. For instance, some true somatic variants may have VAFs that fail to satisfy the one or more threshold VAFs. Furthermore, a true somatic variant may also fail to satisfy the threshold pathogenicity score. As mentioned previously, the somatic- variant calling system 106 performs a number of evaluations to determine whether a candidate variant is a somatic variant. In some examples, the somatic-variant calling system 106 performs additional “rescue evaluations” to identify somatic variants that were missed or not detected in initial evaluations. The following paragraphs describe the somatic-variant calling system 106 performing the following rescue evaluations: determining that the candidate variant segregates into one parental haplotypeand determining that the VAF of the candidate variant deviates from VAFs of a subset of local germline variants.

[0081] As illustrated in FIG. 3B, for example, the somatic-variant calling system 106 can further perform an act 324 of determining, from phased nucleotide reads, that the candidate variant segregates into one parental haplotype. Generally, the somatic-variant calling system 106 determines that somatic variants are likely to exist on some but not all haplotype reads. More specifically, somatic variants are likely to segregate into a part of one parental haplotype. For example, and as shown in FIG. 3B, the somatic-variant calling system 106 can identify a germline variant 330 that exists in approximately half of observed nucleotide reads.

[0082] Because the germline variant 330 is heterozygous, the somatic-variant calling system 106 can use the germline variant 330 and / or other germline variants to phase the nucleotide reads according to paternal haplotypes and maternal haplotypes. For example, reads including the germline variant 330 belong to one parental haplotype, and reads excluding the germline variant 330 belong to the other parental haplotype. The somatic-variant calling system 106 determines that a somatic variant is likely to present within a single parental haplotype. For instance, the somatic-variant calling system 106 determines that a candidate variant 332 presents within a subset of phased nucleotide reads for one parental haplotype. In contrast, the somatic-variant calling system 106 determines that a candidate variant 334 is likely an artefact as it presents or exhibits on nucleotide reads belonging to both parental haplotypes. FIG. 6 and the corresponding discussion further details the somatic-variant calling system 106 determines that the candidate variant segregates into one parental haplotype in accordance with one or more embodiments of the present disclosure.

[0083] Turning nowto FIG. 3C. As shown in FIG. 3C, the somatic-variant calling system 106 performs another rescue evaluation that is independent from the previous rescue evaluation. In particular, FIG. 3C illustrates the somatic-variant calling system 106 performing an act 326 of determining that the VAF of the candidate variant deviates from VAFs of local germline variants. In particular, the somatic-variant calling system 106 determines that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, where the subset of germline variants is located within a threshold number of nucleobases (e.g., within 300 bases; 1,000 bases; 10,000 bases; 1,000,000 bases, or 10,000,000 bases) of the candidate variant.

[0084] As indicated above, in some cases, the somatic-variant calling system 106 determines that somatic variants likely have lower VAFs that surrounding or local germline variants. More particularly, the somatic-variant calling system 106 can determine that somatic variants are more likely to appear later in life and thus would only occupy a fraction of reads relative to neighboringgermline variants. For example, and as shown in FIG. 3C, the somatic-variant calling system 106 determines that a candidate variant 338 is likely a somatic variant because the VAF of the candidate variant 338 is lower than and deviates from VAFs of local germline variants 336a and 336b. FIG.7 and the corresponding discussion detail the somatic- variant calling system 106 determining that the VAF of the candidate variant deviates from VAFs of local germline variants in accordance with one or more implementations of the present disclosure.

[0085] As further shown in FIG. 3C, the somatic-variant calling system 106 performs an act 328 of classifying the candidate variant as a somatic variant. In particular, the somatic-variant calling system 106 classifies the candidate variant as a somatic variant based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and / or the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold. FIG. 8 and the corresponding discussion further detail the somatic-variant calling system 106 classifying a candidate variant as a somatic variant in accordance with one or more embodiments of the present disclosure.

[0086] As mentioned, in some implementations, the somatic-variant calling system 106 detects somatic variants based on determining that a candidate variant satisfies one or more threshold VAFs based on a VAF of the candidate variant. FIG. 4 illustrates the somatic-variant calling system 106 determining that a candidate variant satisfies one or more threshold VAFs in accordance with one or more embodiments of the present disclosure.

[0087] As shown in FIG. 4, the somatic-variant calling system 106 performs an act 402 of grouping candidate variants by variant ID. In some implementations, the somatic-variant calling system 106 identifies or assigns a unique variant ID to each candidate variant from the variant calls. For example, the somatic-variant calling system 106 can identify existing variant IDs as provided in different datasets (e.g., COSMIC ID, dbSNP ID, etc.). If no existing variant ID is present, the somatic-variant calling system 106 creates a unique identifier for each candidate variant based on its genomic coordinates. The somatic-variant calling system 106 can further sort candidate variants into group based on variant ID. For example, group 406 comprises candidate variants with a first variant ID, and group 408 comprises candidate variants with a second variant ID.

[0088] As further shown in FIG. 4, the somatic-variant calling system 106 performs an act 404 of determining whether the VAF of the candidate variants is less than (or equal to) a lower threshold VAF. The somatic-variant calling system 106 compares the number of nucleotide reads carrying a candidate variant with the total number of nucleotide reads for a genomic sample to generate the VAF. As mentioned previously, the somatic-variant calling system 106 determines that, in contrast to heterozygous germline variants that tend to have VAFs approximating 0.5, somatic variants tend to have relatively lower VAFs (e.g., below 0.30, 0.35, 0.40). Based on the VAF of the candidatevariant falling below (or equaling) a lower threshold VAF, the somatic- variant calling system 106 determines that the candidate variant is a somatic variant.

[0089] In some implementations, the somatic-variant calling system 106 determines whether a candidate variant is a somatic variant based on a central tendency of a VAF (e.g., median VAF, mean VAF, or mode VAF) across a population of genomic samples. For example, the somatic-variant calling system 106 determines the median VAF for a candidate variant across a population of genomic samples. The somatic-variant calling system 106 compares the VAF for the candidate variant with the median VAF across such genomic samples to determine whether the candidate variant is a somatic variant. The somatic-variant calling system 106 determines that a candidate variant is likely a somatic variant based on the candidate VAF being less than the lower median VAF.

[0090] In addition or alternative to using a lower threshold VAF, in some implementations, the somatic-variant calling system 106 optionally performs the act 404 by determining whether the VAF of the candidate variant in a genomic sample is greater than an upper threshold VAF, such as a median VAF across genomic samples. In some examples, the somatic-variant calling system 106 determines an upper threshold VAF with a median of 0.65 across genomic samples in a database. Based on determining that a candidate variant with a VAF above (or equal to) the upper threshold VAF of 0.65, somatic-variant calling system 106 determines the candidate variant is likely a somatic variant. In some examples, the somatic-variant calling system 106 can determine different values for the upper threshold VAF (e.g., mean, median, or mode VAF of 0.55, 0.60, 0.65, 0.70, etc.). In some embodiments, the somatic-variant calling system 106 does not determine or utilize an upper threshold VAF. Because VAFs of 50-70% can be created by over-amplification of an allele in sequencing, a homozygous event, or a hemizygous event, the somatic-variant calling system 106 can optionally not use such an upper threshold VAF as a relevant threshold for a somatic variant.

[0091] As shown in FIG. 4, the somatic-variant calling system 106 can determine a lower threshold VAF (and an upper threshold VAF). For instance, the somatic-variant calling system 106 determines a lower threshold VAF of 0.35. The somatic-variant calling system 106 determines that candidate variants having a VAF below (or equal to) the lower threshold VAF with a median of 0.35 across samples are likely somatic variants. In some such cases, the somatic-variant calling system 106 applies a central -tendency -VAF tag to a candidate variant having a VAF that satisfies one or more of the lower (or upper) VAF thresholds and does not apply a central-tendency-VAF tag to another candidate variant having a VAF that fails to satisfy one or more of the lower (or upper) VAF thresholds.

[0092] For instance, and as illustrated in FIG. 4, the somatic-variant calling system 106 determines that a candidate variant 410 is likely a somatic variant as its VAF falls below the lower threshold VAF of 0.35. The somatic-variant calling system 106 determines that a candidate variant 412 and a candidate variant 414 are likely not somatic variants. In particular, the VAF of the candidate variant 412 approximates 0.5, and the VAF of the candidate variant 414 approximates 0.4, both of which are above the lower threshold VAF of 0.35. The somatic-variant calling system 106 can determine different values for the lower threshold VAF (e.g., mean, median, or mode VAF of 0.25, 0.30, 0.35, 0.40, etc.).

[0093] As mentioned, in some implementations, the somatic-variant calling system 106 can evaluate candidate variants by optionally determining whether the pathogenicity score of the candidate variant satisfies a threshold pathogenicity score. FIG. 5 illustrates the somatic- variant calling system 106 determining whether a candidate variant is a somatic variant based on a threshold pathogenicity score in accordance with one or more embodiments of the present disclosure.

[0094] As shown in FIG. 5, the somatic-variant calling system 106 performs an act 502 of determining a threshold pathogenicity score. Pathogenicity scores are often designed to assess whether a mutation causes a gene to lose function, which can be highly effective for identifying pathogenic germline variants. Some germline variants result in a loss of function, making pathogenicity scores suitable for germline variant analysis. In contrast, some somatic variants often lead to a gain of function, such as increased enzyme activity or altered binding regions, which traditional pathogenicity scores can be less adept at detecting. For example, some somatic variants that drive cell replication and growth may have overall low pathogenicity scores despite their clinical significance. Accordingly, rather than relying on a standard pathogenicity threshold to detect somatic variants, in some embodiments, the somatic-variant calling system 106 examines candidate variants with pathogenicity scores above a central tendency of a pathogenicity score (e.g., a mean, median, or mode pathogenicity score) for the gene, ensuring that potentially impactful somatic variants are not overlooked.

[0095] As mentioned, the somatic-variant calling system 106 performs the act 502 of determining a threshold pathogenicity score based on a mean, median, or mode pathogenicity score for a gene. In some embodiments, for example, the somatic-variant calling system 106 determines a median pathogenicity score per gene and uses the median pathogenicity score per gene as the threshold pathogenicity score. The somatic-variant calling system 106 identifies the gene corresponding with the candidate variant.

[0096] As further shown in FIG. 5, the somatic-variant calling system 106 identifies a gene 506 in which a candidate variant exists. The somatic-variant calling system 106 further accesses(e.g., from a database or look-up table) or determines pathogenicity scores for the gene across a population of genomic samples. The somatic-variant calling system 106 accesses or determines a median score based on the pathogenicity score for the gene across the population (e.g., multiple genomic samples). Because the somatic-variant calling system 106 determines the threshold pathogenicity score based on a mean, median, or mode of pathogenicity scores for the gene, the somatic-variant calling system 106 can use different types of pathogenicity scores. For example, in some implementations, the somatic-variant calling system 106 uses Primate Artificial Intelligence (PrimateAI), Primate Artificial Intelligence 3D (PrimateAI-3D), Combined Annotation-Dependent Depletion (CADD), Rare Exome Variant Ensemble Learner (REVEL), ClinPred, Mendelian Clinically Applicable Pathogenicity (M-CAP), and / or other pathogenicity scoring systems.

[0097] As further shown in FIG. 5, the somatic-variant calling system 106 performs an act 504 of determining that the pathogenicity score satisfies the threshold pathogenicity score. In particular, the somatic-variant calling system 106 compares a pathogenicity score for the candidate variant with the threshold pathogenicity score. For example, based on determining that the pathogenicity score for the candidate variant falls below (or equals) the mean, median, or mode pathogenicity score per gene, the somatic-variant calling system 106 determines that the candidate variant is more likely a somatic variant. In some such cases, the somatic-variant calling system 106 applies a pathogenicity tag to a candidate variant having a pathogenicity score that satisfies a threshold pathogenicity score and does not apply a pathogenicity tag to another candidate variant having a pathogenicity score that fails to satisfy the threshold pathogenicity score.

[0098] As mentioned, the somatic-variant calling system 106 can determine whether a candidate variant is a somatic variant based on phased nucleotide reads. FIG. 6 illustrates the somatic-variant calling system 106 determining, from phased nucleotide reads, that the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype in accordance with one or more embodiments of the present disclosure. In particular, the somatic-variant calling system 106 determines that true somatic variants are likely to exist on some but not all haplotype reads. Accordingly, the somatic-variant calling system 106 uses phased nucleotide reads to identify likely somatic variants.

[0099] As shown in FIG. 6, the somatic-variant calling system 106 performs an act 602 of identifying a set of nucleotide reads exhibiting one or more germline variants. For example, and as shown, the somatic-variant calling system 106 identifies germline variants within nucleotide reads. In some implementations, the somatic-variant calling system 106 accesses genomic datasets to identify candidate variants of interest residing in genes or genomic regions that result in CH. Example datasets include, but are not limited to, UKBiobank, Genomics England (GEL), and Vanderbilt University Medical Center (VUMC), or Nashville Biosciences datasets. For example,and as shown in FIG. 6, the somatic-variant calling system 106 identifies nucleotide reads 610 that cover or include a candidate variant within a genomic region identified from a genomic dataset.

[0100] In some embodiments, the somatic-variant calling system 106 determines whether the nucleotide reads 610 also cover or include a common, heterozygous germline variant 608. To illustrate, the somatic-variant calling system 106 can determine that the heterozygous germline variant 608 satisfies a threshold number of occurrences in a germline variant database. For instance, the heterozygous germline variant 608 may comprise a germline SNP that occurs more than a threshold number of times (e.g., 5, 10, 15) within the Genome Aggregation Database (GnomAD), 1000 Genomes Project, Database of Single Nucleotide Polymorphisms (dbSNP), ClinVar, Exome Aggregation Consortium (ExAC), or other genomic databases.

[0101] As further illustrated in FIG. 6, the somatic-variant calling system 106 performs an act 604 of determining a first subset of nucleotide reads corresponding with a first parental haplotype and a second subset of nucleotide reads corresponding with a second parental haplotype. The somatic-variant calling system 106 phases the nucleotide reads 610 into a paternal haplotype and a maternal haplotype. In particular, the somatic-variant calling system 106 divides the nucleotide reads 610 into parental haplotypes based on nucleotide reads that have the heterozygous germline variant 608 and nucleotide reads that do not have the heterozygous germline variant 608. For example, and as shown in FIG. 6, the somatic-variant calling system 106 determines a first subset of nucleotide reads having the heterozygous germline variant 608 corresponding with the first parental haplotype and a second subset of nucleotide reads without the heterozygous germline variant 608 corresponding with the second parental haplotype.

[0102] After performing the act 604, as further shown in FIG. 6, the somatic-variant calling system 106 performs an act 606 of determining that the candidate variant segregates into a portion of the first subset or a portion of the second subset. In particular, the somatic-variant calling system 106 determines that the candidate variant segregates into the portion of the first subset of nucleotide reads or the portion of the second subset of nucleotide reads. As mentioned previously, the somatic-variant calling system 106 determines that, because such variants are introduced in a post-zygotic event or otherwise occur later in life, somatic variants likely exist on a portion of and not all nucleotide reads within a haplotype.

[0103] As shown in FIG. 6, the somatic-variant calling system 106 determines that a candidate variant is likely somatic based on the candidate variant segregating into a portion of the first subset of nucleotide reads corresponding with a first parental haplotype. In other words, the somatic-variant calling system 106 determines that a candidate variant is a somatic variant if the candidate variant exists in only one of the two parental haplotypes and if the candidate variant does not exist on all nucleotide reads within that parental haplotype. In some such cases, the somatic-variantcalling system 106 applies a phasing tag to a candidate variant segregating into a portion of phased nucleotide reads according to one parental haplotype and does not apply a phasing tag to another candidate variant that does not segregate into a portion of phased nucleotide reads according to one parental haplotype.

[0104] For example, and as illustrated in FIG. 6 the somatic- variant calling system 106 determines that candidate variant 614 is likely a somatic variant and, in some cases, applies a phasing tag. In particular, the somatic-variant calling system 106 identifies a first subset of nucleotide reads 618 containing a germline variant 612. The first subset of nucleotide reads 618 corresponds to a first parental haplotype. The somatic-variant calling system 106 also identifies a second subset of nucleotide reads 620 that does not contain the germline variant 612. The somatic-variant calling system 106 determines that the candidate variant 614 is likely somatic because the candidate variant 614 exists on only a portion of, and not all, of the first subset of nucleotide reads 618 corresponding with the first parental haplotype. In contrast, the somatic-variant calling system 106 might identify an artefact 616 because the artefact 616 exists on both the first subset of nucleotide reads 618 and the second subset of nucleotide reads 620.

[0105] As described previously, the somatic-variant calling system 106 can determine that a candidate variant is a germline or somatic variant based on a neighboring germline variant VAFs. FIG. 7 illustrates the somatic-variant calling system 106 determining that a candidate variant is a somatic variant based on determining that the VAF of the candidate variant deviates from VAFs of neighboring germline variants in accordance with one or more embodiments of the present disclosure.

[0106] Application of hard VAF thresholds to identify somatic variants can be problematic, especially where allelic imbalances exist within a genomic sample. For instance, allelic imbalances may arise during the polymerase chain reaction (PCR) amplification process, leading to one allele being sequenced more than another. To mitigate negative impacts of sequencing artefacts on accurate somatic variant identification, the somatic-variant calling system 106 assesses VAFs of common, heterozygous, germline variants that are within a threshold number of nucleobases of the candidate variant. For example, the somatic-variant calling system 106 can identify germline variants within a 1 megabase (Mb) (or 1 million base pairs) region of a somatic variant. The threshold number of nucleobases can be another number, however, such as 300; 1,000; 10,000; or 10,000,000 base pairs. As shown in FIG. 7, the somatic-variant calling system 106 identifies a germline variant 708a and a germline variant 708b located within a threshold number of nucleobases of candidate variant 710. The somatic-variant calling system 106 can determine the threshold number of nucleobases to comprise a region of one thousand base pairs, one hundred thousand base pairs, etc.

[0107] Generally, the somatic-variant calling system 106 determines that a candidate variant comprises a somatic variant if the VAF of the candidate variant deviates significantly from VAFs of neighboring germline variants. The somatic-variant calling system 106 can use various methods to determine that a VAF of the candidate variant deviates significantly from VAFs of local or neighboring germline variants. In some embodiments, the somatic-variant calling system 106 uses a density function (e.g., a cumulative density function) to determine whether the VAF of the candidate variant deviates significantly from local germline VAFs. The series of acts illustrated in FIG. 7 portray the somatic-variant calling system 106 using a cumulative density function in accordance with one or more embodiments of the present disclosure. Additionally, or alternatively, the somatic-variant calling system 106 uses different functions to determine that a candidate variant VAF deviates significantly from local germline VAFs. For example, the somatic-variant calling system 106 can use Bayesian inference, Z-score calculation, T-test, Chi-square goodness of fit test, or other statistical functions and methods to determine if a candidate variant’s VAF is significantly different from the expected range for local germline variant VAFs.

[0108] As shown in FIG. 7, the somatic-variant calling system 106 performs the act 702 of determining a distribution of VAFs of a subset of germline variants. More particularly, the subset of germline variants comprises the germline variants 708a and 708b. The somatic-variant calling system 106 determines the probability density of VAFs of the germline variants 708a and 708b across a population of genomic samples.

[0109] As further shown in FIG. 7, the somatic-variant calling system 106 performs an act 704 of determining a probability of observing the VAF of the candidate variant given the distribution of VAFs. As illustrated, the somatic-variant calling system 106 determines a probability of observing a VAF of the candidate variant 712 given the distribution of VAFs of the subset of germline variants 714. In some examples, as part of performing the act 704, the probability of observing the VAF of the candidate variant given the distribution of VAFs for the subset of germline variants comprises a p-value. The p-value represents the likelihood of observing a VAF as extreme as the candidate variant VAF given the distribution of VAFs of the subset of germline variants 714.

[0110] The somatic-variant calling system 106 further performs an act 706 of determining that the probability of observing the VAF is within the deviation threshold. For example, the somatic-variant calling system 106 determines whether the p-value falls within a deviation threshold. For instance, the somatic-variant calling system 106 can determine the deviation threshold to be a p-value of 0.01, 0.05, 0.1, etc. Lowp-values often suggest that the candidate variant VAF is unlikely to come from the distribution of VAFs of the subset of germline variants 714, implying statistical significance and supporting a potential deviation, such as a somatic variant. In some such cases,the somatic-variant calling system 106 applies a local-germline-VAF tag to a candidate variant having a VAF that deviates from VAFs of a subset of germline variants within a deviation threshold and does not apply a local-germline-VAF tag to another candidate variant having a VAF that fails to deviate from VAFs of a subset of germline variants within the deviation threshold.

[0111] As mentioned, the somatic-variant calling system 106 can evaluate candidate variants using a number of methods to determine whether they are somatic. FIG. 8 illustrates an example decision flowchart by which the somatic-variant calling system 106 determines a candidate variant is a somatic variant in accordance with one or more embodiments of the present disclosure. While the decision flowchart illustrated in FIG. 8 portrays the somatic-variant calling system 106 performing evaluations in sequence, the somatic-variant calling system 106 can also perform the evaluations in parallel to improve processing efficiency. More particularly, the somatic- variant calling system 106 can simultaneously evaluate a candidate variant for one or more of the following tags: (1) a phasing tag, (2) a local-germline-VAF tag, (3) a central-tendency-VAF tag, and (4) a pathogenicity tag. Furthermore, the somatic-variant calling system 106 can perform one or more of the evaluations illustrated in FIG. 8 in any order.

[0112] As illustrated in FIG. 8, the somatic-variant calling system 106 may begin at the start 802 by generating variant calls for a target genomic region of a genomic sample. In some implementations, the somatic-variant calling system 106 uses a variant caller to analyze sequencing data to identify differences between a genomic sample and a reference genome to generate such variant calls. For instance, the somatic-variant calling system 106 may utilize a variant caller, such as Illumina DRAGEN. After generating variant calls as part of the start 802, the somatic-variant calling system 106 can proceed to either an evaluation 804 of determining whether the candidate variant segregates into one parental haplotype or an evaluation 806 of determining whether the VAF of the candidate variant deviates from the VAFs of local germline variants.

[0113] As illustrated in FIG. 8, for example, the somatic-variant calling system 106 can perform the evaluation 804 of determining whether the candidate variant segregates into one parental haplotype. More specifically, the somatic-variant calling system 106 determines whether the candidate variant segregates into a portion of phased nucleotide reads for one parental haplotype. Based on determining that the candidate variant does segregate into a portion of the phased nucleotide reads for one parental haplotype, the somatic- variant calling system 106 performs the act 808 of classifying the candidate variant as a somatic variant. In some such cases, the somatic-variant calling system 106 applies a phasing tag to such a candidate variant segregating into a portion of phased nucleotide reads according to one parental haplotype.

[0114] If the somatic-variant calling system 106 determines that the candidate variant fails the evaluation 804, in some embodiments, the somatic-variant calling system 106 performs anevaluation 806 of determining whether the VAF of the candidate variant deviates from the VAFs of local germline variants. In particular, the somatic- variant calling system 106 determines whether the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, where the subset of germline variants is located within a threshold number of nucleobases of the candidate variant. Based on determining that the candidate variant satisfies the evaluation 806, the somatic-variant calling system 106 performs the act 808 of classifying the candidate variant as a somatic variant. In some such cases, the somatic-variant calling system 106 applies a local-germline-VAF tag to such a candidate variant having a VAF that deviates from VAFs of a subset of germline variants within a deviation threshold. As mentioned, the somatic-variant calling system 106 may perform the evaluation 806 before performing the evaluation 804. In such a case, if the somatic-variant calling system 106 determines that the candidate variant does not satisfy the evaluation 806, the somatic-variant calling system 106 can perform the evaluation 804.

[0115] If the somatic-variant calling system 106 determines that the candidate variant fails the evaluation 804, in some embodiments, the somatic-variant calling system 106 performs the evaluation 806 of determining whether the VAF of the candidate variant deviates from the VAFs of local germline variants. When performing the evaluation 806, the somatic-variant calling system 106 determines whether the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, where the subset of germline variants is located within a threshold number of nucleobases of the candidate variant. As shown in FIG. 8, based on determining that the candidate variant satisfies the evaluation 806, in some cases, the somatic-variant calling system 106 performs the act 808 of classifying the candidate variant as a somatic variant. In some such cases, the somatic-variant calling system 106 applies a local-germline-VAF tag to such a candidate variant having a VAF that deviates from VAFs of a subset of germline variants within a deviation threshold.

[0116] As mentioned, the somatic-variant calling system 106 can perform the evaluations 804, 806, 810, and 812 in different orders. For example, in some cases, the somatic-variant calling system 106 performs the evaluation 806 of determining whether the VAF of the candidate variant deviates from the VAFs of local germline variants before the evaluation 804 of determining whether the candidate variant segregates into one parental haplotype — but only if the candidate variant fails the evaluation 806. Accordingly, the somatic-variant calling system 106 may perform the evaluation 806 before performing the evaluation 804 and, when the candidate variant satisfies the evaluation 806, classify the candidate variant as a somatic variant.

[0117] In some embodiments, the somatic-variant calling system 106 performs one or both of the evaluation 804 and the evaluation 806. For instance, in some implementations, upondetermining that the candidate variant satisfies one of the evaluation 804 or the evaluation 806, the somatic-variant calling system 106 classifies the candidate variants as a somatic variant. In another example, even if the somatic-variant calling system 106 determines that the candidate variant satisfies one of the evaluation 804 or the evaluation 806, the somatic-variant calling system 106 performs the other evaluations 810 or 812. If a candidate variant satisfies both the evaluation 804 and the evaluation 806, the somatic-variant calling system 106 can increase a somatic variant confidence value (e.g., QUAL) indicating a likelihood that the determination that the candidate variant is a somatic variant is correct. If a candidate variant satisfies each of the evaluations 804, 806, 810, and 812, the somatic-variant calling system 106 can further increase such a somatic variant confidence value for the candidate variant.

[0118] If the somatic-variant calling system 106 determines that the candidate variant fails both the evaluations 804 and 806, the somatic-variant calling system 106 can proceed to perform optional evaluations 810 and 812. In some embodiments, based on performing the act 808 of classifying the candidate variant as a somatic variant based on one or both of the evaluation 804 or the evaluation 806, the somatic-variant calling system 106 does not continue to perform the evaluation 810 and the evaluation 812 and instead performs the act 814 of not classifying the candidate variant as a somatic variant. In other embodiments, the somatic-variant calling system 106 performs the evaluation 810 and the evaluation 812 even after classifying the candidate variant as a somatic variant based on the evaluation 804 and the evaluation 806, for example, to improve a somatic variant confidence value.

[0119] As further illustrated in FIG. 8, the somatic-variant calling system 106 can perform the optional evaluation 810 of determining if the candidate variant satisfies one or more threshold VAFs. In some embodiments, as part of the evaluation 810, the somatic-variant calling system 106 determines whether a VAF of the candidate variant satisfies a lower threshold VAF, such as a mean, median or mode VAF across multiple genomic samples. As noted above, the somatic- variant calling system 106 also optionally determines whether a VAF of the candidate variant satisfies an upper threshold VAF. Based on determining that the VAF of the candidate variant satisfies one or more threshold VAFs, the somatic-variant calling system 106 proceeds to perform the evaluation 812. In some such cases, the somatic-variant calling system 106 applies a central -tendency -VAF tag to the candidate variant having a VAF that satisfies one or more of VAF thresholds.

[0120] In some embodiments, and as illustrated in FIG. 8, if the somatic-variant calling system 106 determines that the candidate variant passes (or in some cases fails) the evaluation 810, the somatic-variant calling system 106 performs an optional evaluation 812 of determining if a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score. Based on determining that the candidate variant satisfies the threshold pathogenicity score and / or the one ormore threshold VAFs, the somatic-variant calling system 106 performs the act 808 of classifying the candidate variant as a somatic variant. In some such cases, the somatic-variant calling system 106 applies a pathogenicity tag to the candidate variant having a pathogenicity score that satisfies a threshold pathogenicity score. As shown in FIG. 8, even if the somatic-variant calling system 106 determines that a candidate variant fails the evaluation 804 and the evaluation 806, the somatic-variant calling system 106 can determine that the candidate variant qualifies as a somatic variant based on the candidate variant passing both the evaluation 810 and the evaluation 812.

[0121] As mentioned, the somatic-variant calling system 106 can perform the evaluations 804, 806, 810, and 812 in different orders. For example, in some cases, the somatic-variant calling system 106 performs the evaluation 812 of determining if a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score before the evaluation 810 of determining if the candidate variant satisfies one or more threshold VAFs — but only if the candidate variant fails the evaluation 812. Accordingly, the somatic-variant calling system 106 may perform the evaluation 812 before performing the evaluation 810 and, when the candidate variant satisfies the evaluation 810 or 812, classify the candidate variant as a somatic variant.

[0122] In some embodiments, based on performing the act 808 of classifying the candidate variant as a somatic variant, the somatic-variant calling system 106 can provide, for display via a user interface of a client device, an indicator that a subject corresponding with the genomic sample has clonal hematopoiesis of indeterminate potential (CHIP). More specifically, CHIP is a condition in which blood stem cells acquire somatic variants that lead to the expansion of a genetically distinct population, or clone of blood cells. While individuals with CHIP typically have no immediate symptoms or health problems, CHIP is associated with an increased risk of certain conditions, including hematologic cancers (like leukemia), cardiovascular disease, immune system function, and infections. In some examples, the genomic sample is derived from a fluid sample such as a blood sample or a saliva sample.

[0123] In addition to providing an indicator that a subject has CHIP, the somatic-variant calling system 106 can identify one or more biochemical compounds for treatment. Because the somatic-variant calling system 106 can accurately identify even low concentrations of somatic variants, the somatic-variant calling system 106 can also identify biochemical compounds, such as prophylactic medications, to treat subjects having CHIP. For instance, the somatic-variant calling system 106 can recommend prophylactic antibiotics because individuals with CHIP may be at increased risk of infection.

[0124] In some implementations, the somatic-variant calling system 106 can output a variant call file that includes a classification of the candidate variant as a somatic variant or germline variant. In particular, the somatic-variant calling system 106 can generate a variant call format(VCF) file that includes annotations that label the candidate variant as somatic or not somatic. In some examples, the somatic-variant calling system 106 can further include annotations indicating that a candidate variant is a germline variant. This disclosure briefly describes a pipeline for generating such a VCF above with respect to FIG. 1.

[0125] As mentioned, the somatic-variant calling system 106 can more accurately identify somatic variants relative to existing systems. FIG. 9 illustrates a box plot portraying how the somatic-variant calling system 106 accurately identifies individuals that carry CH somatic variants as demonstrated by the ages of the identified carriers in accordance with one or more embodiments of the present disclosure.

[0126] CH is a disease that is more commonly found in older individuals. Thus, subjects who are CH somatic variant carriers are expected to be older than subjects who are not CH somatic variant carriers. As shown in FIG. 9, as expected, individuals with identified CH somatic variants are older than individuals with no observed CH somatic variants. FIG. 9 illustrates a bar 902 representing individuals (with ages ranging from approximately 48 to 63) that the somatic-variant calling system 106 detected were not CH somatic variant carriers. The median age of individuals represented by the bar 902 is 57. The bar 904 represents individuals (with ages ranging from approximately 46 to 65 and having a median age of 58) that were identified as carrying CH somatic variants by existing sequencing systems but were labeled as “non-carriers” and, therefore, not carrying somatic variants because the somatic-variant calling system 106 later filtered out the individuals’ candidate variants using the techniques described above (e.g., FIGS. 3A-3C and 8). For example, to determine the data represented in the bar 904, the somatic-variant calling system 106 evaluated the identified candidate variants and removed CH somatic variant carriers from the population portrayed by the bar 904.

[0127] As further shown in FIG. 9, the bar 906 represents individuals (with ages ranging from approximately 54 to 66) that the somatic-variant calling system 106 identified as CH somatic variant carriers. The median age of individuals represented by the bar 906 is 61. As indicated by a comparison of the bars 902, 904, and 906, the somatic-variant calling system 106 more likely identified individuals who were CH variant carriers with more accuracy relative to existing sequencing systems because ages of individuals with detected CH somatic variants were higher than ages of non-carrying individuals. The age range of CH variant carrier individuals in the bar 906 is significantly higher than the individuals captured in the bar 902 and the bar 904. Furthermore, as demonstrated by the bar 902 and the bar 904, the somatic-variant calling system 106 accurately filtered out individuals that do not carry CH somatic variants represented by the bar 904, as indicated by the age ranges of the individuals in bar 902 and the individuals in the bar 904 being relatively the same.

[0128] FIG. 10 illustrates a bar graph demonstrating that the somatic-variant calling system 106 accurately produces expected results in identifying CH somatic variants associated with key genes in accordance with one or more implementations of the present disclosure. FIG. 10 illustrates a bar graph portraying most frequently mutated CH genes and CH variants detected by the somatic-variant calling system 106.

[0129] The bar graph illustrated in FIG. 10 lists CH genes according to most frequently mutated CH-associated genes from top to bottom. As shown in FIG. 10, the somatic-variant calling system 106 accurately detects more CH variants in the most frequently mutated CH-associated genes. For example, DNMT3A is the most mutated, then TET2, ASXL1, etc. As shown in FIG. 10, the somatic-variant calling system 106 identifies expected proportions of CH variants corresponding to frequently mutated genes. For example, the somatic-variant calling system 106 detects the greatest number of CH variants in the DNMT3A gene, the second most number of CH variants in the TET2 gene, etc.

[0130] In some implementations, the somatic-variant calling system 106 can perform subtypespecific phenotype analysis to show associations with CH at the level of specific genes or amino acids. For example, FIG. 11 illustrates a bubble plot portraying increased risk of certain diseases and conditions that the somatic-variant calling system 106 associates with specific gene mutations in accordance with one or more embodiments of the present disclosure.

[0131] In particular, FIG. 11 illustrates a bubble plot showing somatic variants in CH genes and their associated increased risk of common diseases (e.g., cardiovascular, kidney, or liver disease); cancer (e.g., lung, breast, hematological cancer), and serious infection (e.g., bacterial or viral infection) as identified by the somatic- variant calling system 106. In contrast to existing sequencing systems that often only generally associate CH with infection, the somatic- variant calling system 106 can link certain conditions (e.g., hematological cancer, bacterial sepsis, etc.) with somatic variants in specific genes (e.g., DNMT3A, ASXL1, RUNX1, IDH2, SRSF2, etc.). As shown in FIG. 11, and as expected, the somatic-variant calling system 106 identifies a link between somatic variants in CH-associated genes and hematological cancers. Additionally, and as shown in FIG. 11, the somatic-variant calling system 106 identifies a link between a serious bacterial infection (e.g., bacterial sepsis and pneumonia infectious disease) with somatic variants within CH-associated genes. By contrast, FIG. 11 illustrates no such link between serious viral infections and somatic variants within CH-associated genes.

[0132] As mentioned, because the somatic-variant calling system 106 performs multiple evaluations to identify somatic variants, the somatic-variant calling system 106 can detect somatic variants, even when they have traits similar to germline variants. FIG. 12 illustrates a stacked bar chart demonstrating that the somatic-variant calling system 106 can accurately identify somaticvariants of various clone size. FIG. 12 further illustrates increased risks of hematological cancer or Myelodysplastic Syndrome (MDS) and serious bacterial infection increases with clone size.

[0133] As described previously, germline variants often have VAFs approximating 0.5. Accordingly, many existing systems struggle to detect somatic variants of large clone size that also have VAFs near 0.5 or 1. FIG. 12 illustrates the somatic-variant calling system 106 accurately identifies somatic variants of varying clone sizes. For instance, the somatic-variant calling system 106 identifies small clones having VAF between 0 and 0.1, medium clones with VAFs between 0.1 and 0.4, and large clones with VAFs between 0.4 and 1. Furthermore, and as shown in FIG. 12, the risk of hematological cancer or MDS and serious bacterial infection increase with clone size.

[0134] Turning now to FIG. 13, this figure illustrates an example flowchart of a series of acts for classifying a candidate variant as a somatic variant in accordance with one or more embodiments of the present disclosure. While FIG. 13 illustrates acts according to particular embodiments, alternative embodiments may omit, add to, reorder, and / or modify any of the acts shown in FIG. 13. The acts of FIG. 13 can be performed as part of a method. Alternatively, a non-transitory computer readable storage medium can comprise instructions that, when executed by one or more processors, cause a computing device to perform the acts depicted in FIG. 13. In still further embodiments, a system comprising at least one processor and a non-transitory computer readable medium comprising instructions that, when executed by one or more processors, cause the system to perform the acts of FIG. 13.

[0135] As shown in FIG. 13, the series of acts 1300 includes an act 1302 of generating variant calls from nucleotide reads, an act 1304 of determining that a candidate variant from among the variant calls satisfies one or more threshold VAFs, an act 1306 of determining, from phased nucleotide reads, the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype, an act 1308 of determining that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, and an act 1310 of classifying the candidate variant as a somatic variant. As further indicated in FIG. 13, the act 1304 of determining that a candidate variant from among the variant calls satisfies one or more threshold VAFs is optional. Further, the acts 1306 and 1308 can both be performed together or, alternatively, one of the acts 1306 or 1308 can be performed as a basis for performing the act 1310 of classifying the candidate variant as a somatic variant.

[0136] For example, the series of acts 1300 can include acts to perform any of the operations described in the following clauses:CLAUSE 1. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;determining that a candidate variant from among the variant calls satisfies one or more threshold variant allele fractions (VAFs) based on a variant allele fraction (VAF) of the candidate variant among the nucleotide reads;determining, from phased nucleotide reads for the target genomic region of the genomic sample, the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype;determining that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, wherein the subset of germline variants are located within a threshold number of nucleobases of the candidate variant; andclassifying the candidate variant as a somatic variant based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.CLAUSE 2. The computer-implemented method of clause 1, further comprising: determining that a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score; andclassifying the candidate variant as a somatic variant further based on the pathogenicity score of the candidate variant satisfying the threshold pathogenicity score.CLAUSE 3. The computer-implemented method of clause 1 or 2, wherein determining that the candidate variant satisfies the one or more threshold VAFs comprisesdetermining that the VAF of the candidate variant is less than a lower threshold VAF.CLAUSE 4. The computer-implemented method of any of clauses 1-3, wherein determining that the candidate variant satisfies the one or more threshold VAFs comprises determining that the VAF of the candidate variant exceeds or equals an upper threshold VAF.CLAUSE 5. The computer-implemented method of any of clauses 1-4, wherein determining that the candidate variant segregates into the portion of the phased nucleotide reads comprises:identifying, from among the nucleotide reads, a set of nucleotide reads carrying one or more germline variants;determining, based on the one or more germline variants, a first subset of nucleotide reads corresponding with a first parental haplotype and a second subset of nucleotide reads corresponding with a second parental haplotype, wherein the phased nucleotide reads comprise the first subset of nucleotide reads and the second subset of nucleotide reads; anddetermining that the candidate variant segregates into the portion of the first subset of nucleotide reads or the portion of the second subset of nucleotide reads.CLAUSE 6. The computer-implemented method of any of clauses 1-5, wherein determining that the VAF of the candidate variant deviates from the VAF for the subset of germline variants within the deviation threshold comprises:generating, utilizing a density function, a distribution of the VAFs of the subset of germline variants;determining a probability of observing the VAF of the candidate variant given the distribution of the VAFs of the subset of germline variants; anddetermining that the probability of observing the VAF of the candidate variant is within the deviation threshold.CLAUSE 7. The computer-implemented method of any of clauses 1-6, further comprising: determining that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads; ordetermining that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score;determining that the VAF of the additional candidate variant deviates from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; andclassifying the additional candidate variant as a somatic variant based on the VAF of the additional candidate variant deviating from the VAFs of the additional subset of germline variants within the deviation threshold.CLAUSE 8. The computer-implemented method of any of clauses 1-7, further comprising: determining that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads;determining that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score; ordetermining that the VAF of the additional candidate variant does not deviate from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; anddetermining, from the phased nucleotide reads for the target genomic region of the genomic sample, that the additional candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype; andclassifying the additional candidate variant as a somatic variant based on the additional candidate variant segregating into a portion of the phased nucleotide reads for the one parental haplotype.CLAUSE 9. The computer-implemented method of any of clauses 1-8, further comprising identifying, based on classifying the candidate variant as a somatic variant, a biochemical compound for treatment.CLAUSE 10. The computer-implemented method of any of clauses 1-9, further comprising providing, for display via a user interface of a client device, an indicator that a subject corresponding with the genomic sample has clonal hematopoiesis of indeterminate potential (CHIP), wherein the genomic sample comprises deoxyribonucleic acid (DNA) from a blood sample or a saliva sample.CLAUSE 11. The computer-implemented method of any of clauses 1-10, further comprising determining the candidate variant by:determining that the candidate variant is in a known clonal-hematopoiesis-associated region in a threshold number of samples from a mutation database;determining that an allele frequency of the candidate variant is less than or satisfies a threshold population allele frequency for the threshold number of samples; andselecting, based on the allele frequency of the candidate variant satisfying the threshold population allele frequency, the candidate variant for labeling derived from the one or more threshold VAFs, segregation into one parental haplotype, and the deviation threshold.CLAUSE 12. The computer-implemented method of any of clauses 1-11, wherein classifying the candidate variant as a somatic variant comprises classifying the candidate variant as a clonal variant.CLAUSE 13. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;phasing the nucleotide reads for the target genomic region of the genomic sample according to a set of parental haplotypes;determining, from the phased nucleotide reads for the target genomic region of the genomic sample, a candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype of the set of parental haplotypes; andclassifying the candidate variant as a somatic variant based on the candidate variant segregating into the portion of the phased nucleotide reads.CLAUSE 14. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;identifying a subset of germline variants located within a threshold number of nucleobases of a candidate variant of the variant calls;determining that a VAF of the candidate variant deviates from variant allele fractions (VAFs) of the subset of germline variants within a deviation threshold; andclassifying the candidate variant as a somatic variant based on the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

[0137] Appendix A illustrates one or more embodiments of the somatic-variant calling system and is hereby incorporated by reference.

[0138] Appendix B illustrates charts and graphs portraying supplemental data corresponding with one or more embodiments of the somatic-variant calling system and is hereby incorporated by reference.

[0139] The methods described herein can be used in conjunction with a variety of nucleic acid sequencing techniques. Particularly applicable techniques are those wherein nucleic acids are attached at fixed locations in an array such that their relative positions do not change and wherein the array is repeatedly imaged. Embodiments in which images are obtained in different color channels, for example, coinciding with different labels used to distinguish one nucleobase typefrom another are particularly applicable. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid (i.e., a nucleic acid polymer) can be an automated process. Preferred embodiments include sequencing-by-synthesis (SBS) techniques.

[0140] SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.

[0141] SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using y-phosphate-labeled nucleotides, as set forth in further detail below. In methods using nucleotide monomers lacking terminators, the number of nucleotides added in each cycle is generally variable and dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.).

[0142] SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a byproduct of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).

[0143] Preferred embodiments include pyrosequencing techniques. Pyrosequencing detects the release of inorganic pyrophosphate (PPi) as particular nucleotides are incorporated into the nascent strand (Ronaghi, M., Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996) " Real-time DNA sequencing using detection of pyrophosphate release." Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) " Pyrosequencing sheds light on DNA sequencing." Genome Res.11(1), 3-11; Ronaghi, M., Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-time pyrophosphate.” Science 281(5375), 363; U. S. Pat. No. 6,210,891; U. S. Pat. No. 6,258,568 and U. S. Pat. No. 6,274,320, the disclosures of which are incorporated herein by reference in their entireties). In pyrosequencing, released PPi can be detected by being immediately converted to adenosine triphosphate (ATP) by ATP sulfurylase, and the level of ATP generated is detected via luciferase-produced photons. The nucleic acids to be sequenced can be attached to features in an array and the array can be imaged to capture the chemiluminescent signals that are produced due to incorporation of a nucleotides at the features of the array. An image can be obtained after the array is treated with a particular nucleotide type (e.g., A, T, C or G). Images obtained after addition of each nucleotide type will differ with regard to which features in the array are detected. These differences in the image reflect the different sequence content of the features on the array. However, the relative locations of each feature will remain unchanged in the images. The images can be stored, processed and analyzed using the methods set forth herein. For example, images obtained after treatment of the array with each different nucleotide type can be handled in the same way as exemplified herein for images obtained from different detection channels for reversible terminatorbased sequencing methods.

[0144] In another exemplary type of SBS, cycle sequencing is accomplished by stepwise addition of reversible terminator nucleotides containing, for example, a cleavable or photobleachable dye label as described, for example, in WO 04 / 018497 and U. S. Pat. No.7,057,026, the disclosures of which are incorporated herein by reference. This approach is being commercialized by Solexa (now Illumina Inc.), and is also described in WO 91 / 06678 and WO 07 / 123,744, each of which is incorporated herein by reference. The availability of fluorescently labeled terminators in which both the termination can be reversed, and the fluorescent label cleaved facilitates efficient cyclic reversible termination (CRT) sequencing. Polymerases can also be coengineered to efficiently incorporate and extend from these modified nucleotides.

[0145] Preferably in reversible terminator-based sequencing embodiments, the labels do not substantially inhibit extension under SBS reaction conditions. However, the detection labels can be removable, for example, by cleavage or degradation. Images can be captured following incorporation of labels into arrayed nucleic acid features. In particular embodiments, each cycle involves simultaneous delivery of four different nucleotide types to the array and each nucleotide type has a spectrally distinct label. Four images can then be obtained, each using a detection channel that is selective for one of the four different labels. Alternatively, different nucleotide types can be added sequentially, and an image of the array can be obtained between each addition step. In such embodiments, each image will show nucleic acid features that have incorporated nucleotides of a particular type. Different features are present or absent in the different images due the different sequence content of each feature. However, the relative position of the features will remainunchanged in the images. Images obtained from such reversible terminator- SB S methods can be stored, processed and analyzed as set forth herein. Following the image capture step, labels can be removed, and reversible terminator moieties can be removed for subsequent cycles of nucleotide addition and detection. Removal of the labels after they have been detected in a particular cycle and prior to a subsequent cycle can provide the advantage of reducing background signal and crosstalk between cycles. Examples of useful labels and removal methods are set forth below.

[0146] In particular embodiments some or all of the nucleotide monomers can include reversible terminators. In such embodiments, reversible terminators / cleavable fluors can include fluor linked to the ribose moiety via a 3' ester linkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporated herein by reference). Other approaches have separated the terminator chemistry from the cleavage of the fluorescence label (Ruparel et al., Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated herein by reference in its entirety). Ruparel et al described the development of reversible terminators that used a small 3' allyl group to block extension but could easily be deblocked by a short treatment with a palladium catalyst. The fluorophore was attached to the base via a photocleavable linker that could easily be cleaved by a 30 second exposure to long wavelength UV light. Thus, either disulfide reduction or photocleavage can be used as a cleavable linker. Another approach to reversible termination is the use of natural termination that ensues after placement of a bulky dye on a dNTP. The presence of a charged bulky dye on the dNTP can act as an effective terminator through steric and / or electrostatic hindrance. The presence of one incorporation event prevents further incorporations unless the dye is removed. Cleavage of the dye removes the fluor and effectively reverses the termination. Examples of modified nucleotides are also described in U. S. Pat. No. 7,427,673, and U. S. Pat. No. 7,057,026, the disclosures of which are incorporated herein by reference in their entireties.

[0147] Additional exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U. S. Patent Application Publication No.2007 / 0166705, U. S. Patent Application Publication No. 2006 / 0188901, U. S. Pat. No. 7,057,026, U. S. Patent Application Publication No. 2006 / 0240439, U. S. Patent Application Publication No.2006 / 0281109, PCT Publication No. WO 05 / 065814, U. S. Patent Application Publication No.2005 / 0100900, PCT Publication No. WO 06 / 064199, PCT Publication No. WO 07 / 010,251, U. S. Patent Application Publication No. 2012 / 0270305 and U. S. Patent Application Publication No.2013 / 0260372, the disclosures of which are incorporated herein by reference in their entireties.

[0148] Some embodiments can utilize detection of four different nucleotides using fewer than four different labels. For example, SBS can be performed utilizing methods and systems described in the incorporated materials of U. S. Patent Application Publication No. 2013 / 0079232. As a first example, a pair of nucleotide types can be detected at the same wavelength, but distinguished basedon a difference in intensity for one member of the pair compared to the other, or based on a change to one member of the pair (e.g. via chemical modification, photochemical modification or physical modification) that causes apparent signal to appear or disappear compared to the signal detected for the other member of the pair. As a second example, three of four different nucleotide types can be detected under particular conditions while a fourth nucleotide type lacks a label that is detectable under those conditions, or is minimally detected under those conditions (e.g., minimal detection due to background fluorescence, etc.). Incorporation of the first three nucleotide types into a nucleic acid can be determined based on presence of their respective signals and incorporation of the fourth nucleotide type into the nucleic acid can be determined based on absence or minimal detection of any signal. As a third example, one nucleotide type can include label(s) that are detected in two different channels, whereas other nucleotide types are detected in no more than one of the channels. The aforementioned three exemplary configurations are not considered mutually exclusive and can be used in various combinations. An exemplary embodiment that combines all three examples, is a fluorescent-based SBS method that uses a first nucleotide type that is detected in a first channel (e.g. dATP having a label that is detected in the first channel when excited by a first excitation wavelength), a second nucleotide type that is detected in a second channel (e.g. dCTP having a label that is detected in the second channel when excited by a second excitation wavelength), a third nucleotide type that is detected in both the first and the second channel (e.g. dTTP having at least one label that is detected in both channels when excited by the first and / or second excitation wavelength) and a fourth nucleotide type that lacks a label that is not, or minimally, detected in either channel (e.g. dGTP having no label).

[0149] Further, as described in the incorporated materials of U. S. Patent Application Publication No. 2013 / 0079232, sequencing data can be obtained using a single channel. In such so-called one-dye sequencing approaches, the first nucleotide type is labeled but the label is removed after the first image is generated, and the second nucleotide type is labeled only after a first image is generated. The third nucleotide type retains its label in both the first and second images, and the fourth nucleotide type remains unlabeled in both images.

[0150] Some embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides. The oligonucleotides typically have different labels that are correlated with the identity of a particular nucleotide in a sequence to which the oligonucleotides hybridize. As with other SBS methods, images can be obtained following treatment of an array of nucleic acid features with the labeled sequencing reagents. Each image will show nucleic acid features that have incorporated labels of a particular type. Different features are present or absent in the different images due the different sequence content of each feature, but the relative position of the featureswill remain unchanged in the images. Images obtained from ligation-based sequencing methods can be stored, processed and analyzed as set forth herein. Exemplary SBS systems and methods which can be utilized with the methods and systems described herein are described in U. S. Pat. No.6,969,488, U. S. Pat. No. 6,172,218, and U. S. Pat. No. 6,306,597, the disclosures of which are incorporated herein by reference in their entireties.

[0151] Some embodiments can utilize nanopore sequencing (Deamer, D. W. & Akeson, M. " Nanopores and nucleic acids: prospects for ultrarapid sequencing." Trends Biotechnol. 18, 147-151 (2000); Deamer, D. and D. Branton, " Characterization of nucleic acids by nanopore analysis". Acc. Chem. Res. 35:817-825 (2002); Li, J., M. Gershow, D. Stein, E. Brandin, and J. A. Golovchenko, " DNA molecules and configurations in a solid-state nanopore microscope" Nat. Mater. 2:611-615 (2003), the disclosures of which are incorporated herein by reference in their entireties). In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as a-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. (U. S. Pat. No. 7,001,792; Soni, G. V. & Meller, " A. Progress toward ultrafast DNA sequencing using solid-state nanopores." Clin. Chem. 53, 1996-2001 (2007); Healy, K. " Nanopore-based single-molecule DNA analysis." Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. & Ghadiri, M. R. " A single-molecule nanopore device detects DNA polymerase activity with single-nucleotide resolution." J. Am. Chem. Soc. 130, 818-820 (2008), the disclosures of which are incorporated herein by reference in their entireties). Data obtained from nanopore sequencing can be stored, processed and analyzed as set forth herein. In particular, the data can be treated as an image in accordance with the exemplary treatment of optical images and other images that is set forth herein.

[0152] Some embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and y-phosphate-labeled nucleotides as described, for example, in U. S. Pat. No. 7,329,492 and U. S. Pat. No.7,211,414 (each of which is incorporated herein by reference) or nucleotide incorporations can be detected with zero-mode waveguides as described, for example, in U. S. Pat. No. 7,315,019 (which is incorporated herein by reference) and using fluorescent nucleotide analogs and engineered polymerases as described, for example, in U. S. Pat. No. 7,405,281 and U. S. Patent Application Publication No. 2008 / 0108082 (each of which is incorporated herein by reference). The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labeled nucleotides can be observed with low background (Levene, M. J. et al. " Zero-mode waveguides for single-molecule analysis at high concentrations."Science 299, 682-686 (2003); Lundquist, P. M. et al. " Parallel confocal detection of single molecules in real time." Opt. Lett. 33, 1026-1028 (2008); Korlach, J. et al. " Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nano structures." Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), the disclosures of which are incorporated herein by reference in their entireties). Images obtained from such methods can be stored, processed and analyzed as set forth herein.

[0153] Some SBS embodiments include detection of a proton released upon incorporation of a nucleotide into an extension product. For example, sequencing based on detection of released protons can use an electrical detector and associated techniques that are commercially available from Ion Torrent (Guilford, CT, a Life Technologies subsidiary) or sequencing methods and systems described in US 2009 / 0026082 Al; US 2009 / 0127589 Al; US 2010 / 0137143 Al; or US 2010 / 0282617 Al, each of which is incorporated herein by reference. Methods set forth herein for amplifying target nucleic acids using kinetic exclusion can be readily applied to substrates used for detecting protons. More specifically, methods set forth herein can be used to produce clonal populations of amplicons that are used to detect protons.

[0154] The above SBS methods can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR as described in further detail below.

[0155] The methods set forth herein can use arrays having features at any of a variety of densities including, for example, at least about 10 features / cm2, 100 features / cm2, 500 features / cm2, 1,000 features / cm2, 5,000 features / cm2, 10,000 features / cm2, 50,000 features / cm2, 100,000 features / cm2, 1,000,000 features / cm2, 5,000,000 features / cm2, or higher.

[0156] An advantage of the methods set forth herein is that they provide for rapid and efficient detection of a plurality of target nucleic acid in parallel. Accordingly, the present disclosure provides integrated systems capable of preparing and detecting nucleic acids using techniquesknown in the art such as those exemplified above. Thus, an integrated system of the present disclosure can include fluidic components capable of delivering amplification reagents and / or sequencing reagents to one or more immobilized DNA fragments, the system comprising components such as pumps, valves, reservoirs, fluidic lines and the like. A flow cell can be configured and / or used in an integrated system for detection of target nucleic acids. Exemplary flow cells are described, for example, in US 2010 / 0111768 Al and US Ser. No. 13 / 273,666, each of which is incorporated herein by reference. As exemplified for flow cells, one or more of the fluidic components of an integrated system can be used for an amplification method and for a detection method. Taking a nucleic acid sequencing embodiment as an example, one or more of the fluidic components of an integrated system can be used for an amplification method set forth herein and for the delivery of sequencing reagents in a sequencing method such as those exemplified above. Alternatively, an integrated system can include separate fluidic systems to carry out amplification methods and to carry out detection methods. Examples of integrated sequencing systems that are capable of creating amplified nucleic acids and also determining the sequence of the nucleic acids include, without limitation, the MiSeqTM platform (Illumina, Inc., San Diego, CA) and devices described in US Ser. No. 13 / 273,666, which is incorporated herein by reference. The sequencing system described above sequences nucleic acid polymers present in samples received by a sequencing device, as described further above.

[0157] Further, the methods and compositions disclosed herein may be useful to amplify a nucleic acid sample having low-quality nucleic acid molecules, such as degraded and / or fragmented genomic DNA from a forensic sample. In one embodiment, forensic samples can include nucleic acids obtained from a crime scene, nucleic acids obtained from a missing persons DNA database, nucleic acids obtained from a laboratory associated with a forensic investigation or include forensic samples obtained by law enforcement agencies, one or more military services or any such personnel. The nucleic acid sample may be a purified sample or a crude DNA containing lysate, for example derived from a buccal swab, paper, fabric or other substrate that may be impregnated with saliva, blood, or other bodily fluids. As such, in some embodiments, the nucleic acid sample may comprise low amounts of, or fragmented portions of DNA, such as genomic DNA. In some embodiments, target sequences can be present in one or more bodily fluids including but not limited to, blood, sputum, plasma, semen, urine and serum. In some embodiments, target sequences can be obtained from hair, skin, tissue samples, autopsy or remains of a victim. In some embodiments, nucleic acids including one or more target sequences can be obtained from a deceased animal or human. In some embodiments, target sequences can include nucleic acids obtained from non-human DNA such a microbial, plant or entomological DNA. In some embodiments, target sequences or amplified target sequences are directed to purposes of humanidentification. In some embodiments, the disclosure relates generally to methods for identifying characteristics of a forensic sample. In some embodiments, the disclosure relates generally to human identification methods using one or more target specific primers disclosed herein or one or more target specific primers designed using the primer design criteria outlined herein. In one embodiment, a forensic or human identification sample containing at least one target sequence can be amplified using any one or more of the target-specific primers disclosed herein or using the primer criteria outlined herein.

[0158] The components of the somatic-variant calling system 106 can include software, hardware, or both. For example, the components of the somatic-variant calling system 106 can include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices (e.g., the client device 114, the local device 108, or the server device(s) 110). When executed by the one or more processors, the computer-executable instructions of the somatic-variant calling system 106 can cause the computing devices to perform the bubble detection methods described herein. Alternatively, the components of the somatic-variant calling system 106 can comprise hardware, such as special purpose processing devices to perform a certain function or group of functions. Additionally, or alternatively, the components of the somatic-variant calling system 106 can include a combination of computer-executable instructions and hardware.

[0159] Furthermore, the components of the somatic-variant calling system 106 performing the functions described herein with respect to the somatic-variant calling system 106 may, for example, be implemented as part of a stand-alone application, as a module of an application, as a plug-in for applications, as a library function or functions that may be called by other applications, and / or as a cloud-computing model. Thus, components of the somatic-variant calling system 106 may be implemented as part of a stand-alone application on a personal computing device or a mobile device. Additionally, or alternatively, the components of the somatic-variant calling system 106 may be implemented in any application that provides sequencing services including, but not limited to, Illumina BaseSpace, Illumina DRAGEN, or Illumina TruSight software. “Illumina,” “BaseSpace,” “DRAGEN,” and “TruSight,” are either registered trademarks or trademarks of Illumina, Inc. in the United States and / or other countries.

[0160] Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and / or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.

[0161] Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computerexecutable instructions are non-transitory computer-readable storage media (devices). Computer-readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer-readable media: non-transitory computer-readable storage media (devices) and transmission media.

[0162] Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives (SSDs) (e.g., based on RAM), Flash memory, phasechange memory (PCM), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

[0163] A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and / or modules and / or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and / or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media.

[0164] Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non-transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a NIC), and then eventually transferred to computer system RAM and / or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.

[0165] Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and / or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

[0166] Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

[0167] Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.

[0168] A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (laaS). A cloud-computing model can also be deployed using different deploymentmodels such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a “cloud-computing environment” is an environment in which cloud computing is employed.

[0169] FIG. 14 illustrates a block diagram of a computing device 1400 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 1400 may implement the somatic-variant calling system 106 and the sequencing device system 104. As shown by FIG. 14, the computing device 1400 can comprise a processor 1402, a memory 1404, a storage device 1406, an I / O interface 1408, and a communication interface 1410, which may be communicatively coupled by way of a communication infrastructure 1412. In certain embodiments, the computing device 1400 can include fewer or more components than those shown in FIG. 14. The following paragraphs describe components of the computing device 1400 shown in FIG. 14 in additional detail.

[0170] In one or more embodiments, the processor 1402 includes hardware for executing instructions, such as those making up a computer program. As an example, and not by way of limitation, to execute instructions for dynamically modifying workflows, the processor 1402 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 1404, or the storage device 1406 and decode and execute them. The memory 1404 may be a volatile or nonvolatile memory used for storing data, metadata, and programs for execution by the processor(s). The storage device 1406 includes storage, such as a hard disk, flash disk drive, or other digital storage device, for storing data or instructions for performing the methods described herein.

[0171] The I / O interface 1408 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 1400. The I / O interface 1408 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I / O devices or a combination of such I / O interfaces. The I / O interface 1408 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I / O interface 1408 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and / or any other graphical content as may serve a particular implementation.

[0172] The communication interface 1410 can include hardware, software, or both. In any event, the communication interface 1410 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 1400 and one or more other computing devices or networks. As an example, and not by way of limitation, the communication interface 1410 may include a network interface controller (NIC) or network adapterfor communicating with an Ethernet or other wire-based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.

[0173] Additionally, the communication interface 1410 may facilitate communications with various types of wired or wireless networks. The communication interface 1410 may also facilitate communications using various communication protocols. The communication infrastructure 1412 may also include hardware, software, or both that couples components of the computing device 1400 to each other. For example, the communication interface 1410 may use one or more networks and / or protocols to enable a plurality of computing devices connected by a particular infrastructure to communicate with each other to perform one or more aspects of the processes described herein. To illustrate, the sequencing process can allow a plurality of devices (e.g., a client device, sequencing device, and server device(s)) to exchange information such as sequencing data and error notifications.

[0174] In the foregoing specification, the present disclosure has been described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the present disclosure(s) are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure.

[0175] The present disclosure may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, the methods described herein may be performed with less or more steps / acts or the steps / acts may be performed in differing orders. Additionally, the steps / acts described herein may be repeated or performed in parallel with one another or in parallel with different instances of the same or similar steps / acts. The scope of the present application is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.Appendix ASomatic leukocyte clone size drives bacterial sepsis and infectionAbstractBackgroundClonal Haematopoiesis (CH) is caused by mutations driving clonal expansion in haematopoietic stem cells (HSCs) and is known to substantially increase the risk of developing haematopoietic malignancies. Other non-cancerous, clinical associations with CH have been reported, but challenges in identifying authentic somatic mutations across the general population have led to conflicting results.MethodsUsing an improved somatic variant calling approach, we identify 19,822 somatic mutations across 43 known CH genes in 40,560 out of 590,537 individuals from 3 population cohorts (6.9%). Phenotypic associations detected via survival analysis were validated in a metaanalysis including two further cohorts (889,215 individuals).ResultsThe presence of one or more of the detected CH mutations not only leads to increased risk of haematopoietic malignancies (HR = 2.15, P = 1.84x10'75), but surprisingly is associated with the development of serious bacterial infections (SBI) (HR = 1.22, P = 4.66x1 O'29), and subsequent mortality within 28 days of infection (RR = 1.84, P = 0.0012) prior to or without an accompanying cancer diagnosis and regardless of bacterial strain. Of note, recurrent driver mutations in SRSF2 (n = 400), DNMT3A (n = 1,381), and SF3B1 (n = 177) confer large absolute risks: 25% of individuals with expanded (> 40%), acquired mutations at the P95 residue in SRSF2(n = 55) developed a SBI within 10 years (P = 1.51x10'11).ConclusionsClonal HSC expansions impact proper immune response to bacterial infection. In particular, the presence of clonal mutations affecting SRSF2 P95 and other key residues could be markers of infection risk and infection related mortality.IntroductionClonal haematopoiesis (CH) occurs when somatic mutations in haematopoietic stem cells (HSCs) provide a relative growth or survival advantage to mutated HSCs and their progeny2-111. The chance of acquiring one or more of these mutations increases with age, leading to detectable CH in peripheral blood populations for over 10% of individuals by age 656J5J6C|onesexpand and grow, so does the risk of developing a haematopoietic malignancy1212. In addition to greatly increasing the risk of blood cancer, several lines of evidence from epidemiological studies and animal models have linked CH with non-malignant, age-related inflammatory conditions including coronary heart disease (CHD12), stroke22, chronic obstructive pulmonary disease (COPD21), chronic kidney disease (CKD22), and chronic liver disease (CLD2-3-). However, the mechanisms by which CH leads to immune systemA-1dysfunction remain poorly understood-, and in some instances CH appears to be a consequence of the immune system’s response to pathogenic infections across decades—-. CH is increasingly recognized as a set of similar, but distinct, subtypes whose phenotypic expression corresponds to the underlying mutation(s) and affected gene(s)1—21. For example, DNMT3A clones are the most common but expand slowly and have mild clinical impacts, SRSF2 clones expand quickly and are preferentially linked to myelodysplastic syndrome, and JAK2 / TET2 clones are most strongly linked to CLD. Although CH has been preliminarily linked to mildly increased rates of bacterial infection32233-3, it is unclear if there are high risk molecular subtypes driving this association.In this study, we investigated the risk of developing a serious bacterial infection (SBI) among 889,215 individuals (including 590,537 individuals in a three-cohort discovery phase) with specific CH-driver mutations in five distinct population cohorts—— We identified molecular subtypes based upon the mutated gene or hotspot residue that showed a significant and dosedependent (extent of clonal expansion) association with infection risk and severity prior to, or independent from, a clinical diagnosis of myeloid malignancy. Notably, the observed increase in infection rate was largely independent of the causal pathogen, although individuals with specific combinations of subtypes had higher rates of infection than those with either mutation alone. Overall, this study not only identifies a subset of individuals at a greatly increased risk for SBI, but it provides a better understanding of the factors involved in how immune system dysregulation can lead to impaired response to bacterial pathogens.ResultsVariant calling and identification of clonal haematopoiesisUsing a comprehensive variant calling and filtering approach, we identified somatic mutations in previously defined CH genes in the UK Biobank (UKBB)—, Genomics England (GEL)32, Vanderbilt University Medical Centre (VUMC), All of Us (AoU)22, and Massachusetts General Brigham Biobank (MGBB)21cohorts (Table 1, Fig. S1). Briefly, after calling likely somatic variants with Mutect2-, we further enriched for real somatic mutations by requiring a low median variant allele fraction (VAF) across carriers of novel mutations, performing read-backed phasing of known germline variants and putative somatic variants, accounting for local germline copy number, and filtering to mutations with predicted functional or pathogenic effects3-4(Fig. S2a, Table S1). Notably, individuals with somatic variants identified by Mutect2 but removed by our filters had a similar age range to other non-carriers (Cohen’s d score = 0.19) and were younger than individuals with CH (Cohen’s d score = 0.55, Fig. 1a, Fig. S2b), consistent with a germline origin for mutations that were filtered out.In the discovery cohorts (UKBB, GEL, VUMC), 19,822 unique mutations were detected across 43 CH genes in 40,560 individuals (6.9% of 590,537, Table 1), including 5,187 individuals with large clones that substantially contribute to mature blood cell populations (VAF > 0.4). Consistent with previous studies’12, DNMT3A, TET2, and ASXL1 were the most frequently mutated (Fig. 1b). Comparing between the discovery and validation cohorts (discovery plus AoU, MGBB), the relative prevalences of molecular subtypes based upon specific genes {SRSF2, SF3B1, TET2, ASXL1, DNMT3A) or recurrently mutated residues {SRSF2 P95, SF3B1 K700, DNMT3A R882, NRAS G12, IDH2 R140) were similar (Table 1).Associations between clonal haematopoiesis and diseaseA-2Cox proportional hazards models followed by fixed effect meta-analysis were used to test for associations between CH and rates of malignancy in the discovery cohorts (Fig. 1c, Table S2). As expected, CH was strongly associated with the risk of developing blood cancer (HRmeta = 2.23, Pmeta = 3.82x10-78), and higher rates of this malignancy were observed for expanded clones (VAF > 0.2) (HRmeta = 3.41, Pmeta = 1.04x1 O'82). In subtype analyses, we observed significant (FDR < 0.05) associations for 17 out of 43 CH genes. Notably, subtypes defined by molecular lesions in DNMT3A, TET2, ASXL1, SRSF2, and SF3B1 exhibited strong, dosedependent (size of clonal expansion) relationships with blood cancer (all Pmeta < 3.67x1 O'12, Fig. 1d, Table S3). Finally, we tested for associations between CH and other types of cancer, finding significantly higher rates of lung cancer (HRmeta = 1.47, Pmeta = 9.81x1 O'16) and breast Cancer (HRmeta = 1.09, P meta = 0.01).In addition to malignant outcomes, we investigated whether CH was associated with benign sequelae of immune system dysfunction. First, we tested whether CH was associated with age-related inflammatory diseases, replicating previous associations with CHD (HRmeta = 1.09, P meta = 7.29x1 O'12) and CKD (HRmeta = 1.10 j P meta — 1.70x1 O'6) and, to a lesser extent, CLD (HRmeta = 1.05, Pmeta = 0.03). Next, we tested whether CH was associated with viral or bacterial infection, finding a significant (FDR < 0.05) association between CH and SBI, which includes sepsis and bacterial pneumonia (HRmeta=1.22, Pmeta=1.51x1026, Fig. 1c) but not CH and viral infections (HRmeta = 1.05, Pmeta = 0.14, Fig. 1c).Although we only considered infections occurring prior to a blood cancer diagnosis, the subtype and dose-dependent associations between CH and SBI closely resembled those between CH and blood cancer. Specifically, seven out of the eight CH subtypes associated with blood cancer were also significantly associated with SBI (Fig. 1c). Moreover, clonal expansion overall (blood cancer: HRmeta = 3.41, Pmeta = 1.04x1 O'82; SBI: HRmeta = 1.49, Pmeta = 1,10x10'38) and within subtypes defined by DNMT3A, TET2, ASXL1, SRSF2, and SF3B1 had correspondingly higher rates of both SBI and blood cancer (Fig. 1d). Notably, individuals with large splicing factor gene clones (VAF > 0.4) had some of the highest risks for developing SBI (HRmeta = 5.26, Pmeta = 1.51 x1 O'11for SRSF2; HRmeta = 5.45, Pmeta = 3.11x1 O'5for SF3B1), but even clones with DNMT3A mutations, typically considered the most mild subtype11-, displayed greatly elevated rates of SBI once expanded (HRmeta = 2.63, Pmeta = 1.39x1 O'8).We next further explored the role of 5 commonly mutated residues (somatic hotspots) associated with blood cancer as exemplar clonal markers {SRSF2 P95§---1£, SF3B1 K70012-, DNMT3A R882, NRAS G12, IDH2 R14O1SJTable 1). These markers allow for more straightforward comparisons across CH subtypes by reducing bias due to (1 ) residual germline variant contamination in large clonal expansions and (2) heterogeneity in variant effects within genes. An analysis of the cumulative incidence of haematological cancers or MDS and SBI showed an increase in incidence in carriers of these clonal mutations (Fig. 1e). While individuals with no identifiable clone had a haematological cancer or MDS incidence rate of 0.49% after 15 years, clone carriers had an incidence rate of 1.10% over the same period. Similarly, individuals with no identifiable clone had an SBI incidence rate of 5.46% and clone carriers had an incidence rate of 8.49%. To further show that CH is a driver of SBI and not just a marker of undiagnosed MDS cases, we looked at the association of CH with SBI in individuals previously diagnosed with haematological cancer or MDS (Fig. 1f). Carriers of recurrent clonal mutations retained their association with SBI, suggesting that these mutations are robust markers of infectious disease outside of their role in haematological malignancies (Table S4).A-3In the discovery cohort, we observed that 0.23% (n = 1329) of individuals have medium or large clonal P95, K700, R882, G12, or R140 expansions. In a mega-analysis, 25% of individuals with a large clonal P95 expansion developed a SBI within 10 years of follow-up time (Pmega = 5.78x10-19, Fig. 1g, Table S5). Large clonal R882 expansions were more common but only 9% of individuals developed a SBI in the same time-frame (Pmega = 6.88x1 O’5, Fig. 1g, Table S5).We replicated findings between CH subtypes and SBI in the validation cohorts. Using a cross-sectional analysis more amenable to limitations in follow-up and phenotyping in the validation cohorts, we combined all 5 cohorts into a joint meta-analysis (n = 889,215, Fig. 2A, Table S6), again finding significant associations for gene-based subtypes (ORmeta = 1.81, Pmeta < 1 x1 O’300for SRSF2 ORmeta = 1.53, Pmeta = 4.54x1 O’232for SF3B1; ORmeta = 1.20, Pmeta < 1x1 O’300for TET2 ORmeta = 1.22, Pmeta < 1x1 O’300for ASXL1, ORmeta = 1.12, Pmeta < 1x10’300for DNMT3A). Finally, we investigated whether CH impacted SBI severity assessed by 28-day mortality rates following SBI in comparison to all individuals diagnosed with SBI. Restricting to individuals with a hotspot subtype, we observed that expanded clone sizes were significantly associated with 28-day mortality (RRmeta = 1.29, Pmeta = 0.0012, Fig. 2b, Table S7). Large clonal P95 expansions had the highest 28-day mortality rates (RRmeta = 1.77, Pmeta = 0.0028, Fig. 2b, Table S7).Carriers of multiple clonal haematopoiesis subtypesTo investigate if carrying multiple CH mutations impacted SBI, we performed a mega-analysis in the discovery cohorts comparing carriers of multiple clonal mutations to carriers of single mutations in the genes previously explored. Overall, we found that 820 individuals carry more than one clonal mutation. Focusing first on the most common hotspot overlap {SRSF2 P95 with TET2 or IDH2 mutations), we observed that 18% SRSF2 P95 only carriers had a SBI within 10 years of follow-up, whereas 36% of individuals with an additional IDH2 mutation had a SBI over the same period (Pmega = 3.51x1 O’5, Fig. 2c, Table S8). Consistent with a stepwise accumulation of mutations during expansion5-5-, P95 clones were more expanded when individuals carried an additional TET2 or IDH2 mutation (Pmega = 8.10x1 O’5and 8.90x1 O’7, respectively, Fig. 2d). Controlling for P95 clonal fraction moderated but did not fully account for the increased risk of SBI (Pmega < 1x1 O’300). Similar potential non-additive effects were observed for TET2 with other mutations, specifically SF3B1 (Pmega = 2.81x1 O’2, Fig. 2e, Table S8).In addition to single nucleotide or short indel variants, somatic copy number variants (CNVs) have been shown to drive CHS. Previous work has shown that CH subtypes defined by clonal CNV expansions are associated with higher rates of infection22. Here, we compared rates of SBI between CH subtypes from this study, CH due to somatic CNVs (deletions only), and carriers of both subtypes. In the UKBB cohort, we observed that CH due to either short variants or CNVs were significantly associated with SBI (P = 8.55x1 O’66and 5.93x1 O’82, respectively) (Fig. 2f, Table S9). However, individuals with DNMT3A R882 or SRSF2 P95 mutations had a higher risk of SBI when they were also carriers of a somatic CNV (P = 0.014 and 0.011, respectively; Fig. 2g, Table S10).Bacterial strains underlying SBI and clonal haematopoiesisA-4Primary bacterial strains detected in culture as part of the SBI diagnostic process were available in the VUMC cohort. Of the 7,262 individuals diagnosed with SBI, 3,520 (48%) had a specific strain detected. In both CH carriers and non-carriers, the most common strains were Staphylococcus, E. Goli, Streptococcus, and Klebsiella. The proportion of strains did not differ by CH status (P = 0.88), nor were there any apparent strain-specific differences in SBI rates across CH subtypes (Fig. 2h, Table S11), suggesting the effect of CH on SBI is largely strain independent.DiscussionUnderstanding the association between specific somatic mutations in HSCs and immune system function is key to identifying high-risk patients; enabling targeted patient monitoring and potentially delivering earlier life saving care. By highlighting individual driver mutations that greatly disturb immune system function, this study provides a more specific understanding of the relationship between CH molecular subtypes and infectious diseases. In doing so, it highlights subtypes with a much greater effect on the immune system than can be seen when observing CH as a singular disease.Of the well studied and often commonly mutated genes (including DNMT3A, TET2, and ASXL1) found in our discovery cohort, rare mutations in splicing factor genes SRSF2 and SF3B1 were observed with higher rates of both haematological malignancies and serious bacterial infections. Additionally, this study observed that probability of infection is directly associated with clonal expansion size. These conclusions showcase the stark differences in patient risk for SBI based not only on the existence of a haematological clone, but on that clone's mutational background and subsequent expansion rate. This not only confirms, but expands upon previous studies highlighting the effect of clone size on subsequent risk of haematopoietic neoplasms2.Five specific amino acid changes were noted for their more extreme association with SBI. Among those mutations, SRSF2 P95 and SF3B1 K700 had the most robust association, with 25% and 18% of large clone carriers respectively developing a SBI within 10 years. Based upon their predicted growth rate, it is possible that clones carrying these mutations, particularly SRSF2 P95H, will become dominant in the blood within 11 years11. Both mutations in SRSF2 and SF3B1 have been studied as change of function mutations that likely alter RNA binding activity and affect the differentiation of stem and progenitor cells in the blood-1^11. Consistent with results in this study, it is likely that the greater the proportion of blood taken over by these transcriptionally dysfunctional cells, the greater the effect on disease.Clonal expansions carrying mutations at DNMT3A R882 were shown to affect infectious disease risk to a lesser extent than its splicing factor counterparts, with only 9% of somatic mutation carriers developing a SBI within 10 years. However, these hotspot mutations occur frequently in a general population (DNMT3A R882 occurs at ~3.5x the frequency of SRSF2 P95 mutations, Table 1). Therefore, mutations affecting DNMT3A R882 may be linked to a greater number of absolute SBI cases in a general population than its splicing factor counterparts SRSF2 P95 and SF3B1 K700 and may be equally relevant when assessing patient risk.Co-occurrent SRSF2 / TET2 mutations have been previously described in myeloid neoplasms3-9-. This study identified similar co-occurrences associated with SBI, including SRSF2 P95 / TET2, SRSF2 P95 / IDH2, and TET2 / SF3B1. Each of these co-occurencesA-5increased the probability of SBI occurrences when compared to either TET2 or SRSF2 P95 mutations alone. Interestingly, these results persisted in individuals who have never been diagnosed with cancer or myelodysplastic syndrome.This study highlights molecular subtypes of CH that show a significant and dose-dependent effect on infection risk and severity. Large clones carrying SRSF2 P95 or SF3B1 K700 mutations had the greatest effect on immune function and as such should be treated as strong clinical markers for SBI. However, the occurrence rate of SBI was elevated in carriers of any of the five identified hotspot mutations, including individuals with no identified haematological cancer. This suggests that the immune disruption caused by these clonal mutations exists outside of their role in haematological malignancies. Overall, this study suggests clinical markers to identify individuals with a high risk of SBI and therefore improves infection risk stratification and patient care.gs Common disease Cancer edto or fergo • gg Ssrioos dons-888Wedksn TP 53’ ASXLh PPM1D> ZPSP2' / DH2-A-7ASXL1 DWT3A TET2 SF'ikX SRSF3 E F CH carriers CH carriers with haemato- CH status ™ logical cancer or MDSA-8GFigure 1: CH is associated with development of a serious bacterial infection (SBI) in a 590,537 individual discovery cohort. A. Individuals with retained CH variants are older than individuals with no observed CH variant. Individuals with suggested CH variants that were later filtered out have comparable ages to those with no observed CH variants. B. Most frequently mutated CH genes and somatic variant annotations. C. Somatic mutations in CH genes increase risk of cancer, infection, and other diseases. D. Risk of haematological cancer and SBI increases with clone size. Small clones; 0 < VAF <= 0.1, medium clones; 0.1 < VAF <= 0.4, large clones; 0.4 < VAF <= 1. E. Cumulative incidence of haematological cancer or MDS and SBI is greater in variant carriers in comparison to noncarriers. SBI in carriers occurs in individuals with no record of haematological cancer or MDS. F. CH variants are associated with SBI in individuals with a prior haematological cancer or MDS diagnosis as well as all individuals.G. Recurrent mutations at specific amino acids reduce the likelihood of SBI free survival as clone size increases. For example, 25% of individuals with large clones carrying mutations at P95 in SRSF2 develop SBI within 10 years.ASerious bacterial infection 6- Ctohs size III Large > <0 TS■» oA-9B martaiity with serious bacterM MedianTime tern genetyping (yearsA-10SRSF2 P95 mutation5 W 15Time from genotyping (years) A-11UKBBTime from genotyping (years)vuycFigure 2: A. Risk of haematological cancer or MDS and serious bacterial infection (SBI) increases with clone size in validation cohort and discovery cohort (889,215 individuals). Small clones; 0 < VAF <= 0.1, medium clones; 0.1 < VAF <= 0.4, large clones; 0.4 < VAF <= 1. B. Recurrent mutations at any of DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12, or at individual residues increases the likelihood of mortality within 28 days of a SBI diagnosis. C. Clonal mutations in TET2 and IDH2 found in combination with SRSF2 P95 have a greater impact on SBI-free survival than SRSF2 P95 alone. D. Clones carrying secondary mutations alongside SRSF2 P95 are often larger in size as represented by variant allele fraction. E. Clonal mutations in SF3B1 and those affecting SRSF2 P95 found in combination with a TET2 mutation have a greater impact on SBI-free survival than TET2 or SF3B1 mutations alone. F. In the UKBB cohort, risk of SBI is higher at all clone sizes in carriers of mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 (SNP) than in carriers of a clonal copy number variant (CNV). G. In the UKBB cohort, clones carrying both a CNV and a mutation at positions DNMT3A R882, SF3B1 K700, or SRSF2 P95 have a reduced likelihood of SBI-free survival than those carrying either mutation type alone. H. In the VUMC cohort, clonal mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 are most strongly associated with bacterial sepsis and pneumonia caused by Staphylococcus, Streptococcus, and Klebsiella strains. Gram negative (dark green) and positive (light green) strains are distributed evenly with varying levels of significance.UKBB GEL VUMC AoU MGBB Sample size, n 448,083 50,243 92,211 245,333 53,345 Gender, % female 54.3% 55.3% 58.5% 60.5% 55.6% Age, median (years) 58 49 60 56 54A-13Follow-up time, median months(range) 169 (0-205) 73 (0-465) 133 (0-309) - - Blood cancer, n (%) 2,260 (0.50%) 351 (0.70%) 898 (1.00%) 3,306 (1.35%) 3,137 (5.88%)24,779 3,538 7,262 17,915 12,635 Serious bacterial infection, n (%) (5.53%) (7.04%) (7.88%) (7.30%) (23.69%)33,469 2,444 4,647Clonal heamatopoiesis, n (%) (7.47%) (4.86%) (5.04%) - 1,244Large expanded clone, n (%) 3,239 (0.72%) 704 (1.40%) (1.35%) - - 1,17413,462DNMT3A, n (%) (3.00%) 587 (1.16%) (1.27%) 5,273 (2.15%) 2,201 (4.13%)741TET2, n (%) 5,044 (1.13%) 370 (0.74%) (0.80%) 2,814 (1.15%) 848 (1.59%)264ASXL1, n (%) 2,375 (0.53%) 91 (0.18%) (0.29%) 1,486 (0.61%) 367 (0.69%) SRSF2, n (%) 432 (0.09%) 49 (0.10%) 106 (0.11%) 384 (0.16%) 115 (0.22%) SF3B1, n (%) 439 (0.09%) 52 (0.10%) 215 (0.23%) 429 (0.17%) 58 (0.11%) DNMT3A R882, n (%) 1,121 (0.25%) 76 (0.15%) 184 (0.20%) 223 (0.09%) 266 (0.50%) SRSF2 P95, n (%) 276 (0.06%) 40 (0.08%) 84 (0.09%) 245 (0.10%) 130 (0.24%) SF3B1 K700, n (%) 77 (0.02%) 16 (0.03%) 84 (0.09%) 116 (0.05%) 50 (0.09%) NRAS G12, n (%) 23 (0.005%) 2 (0.004%) 3 (0.003%) 44 (0.02%) 12 (0.02%) IDH2R140, n (%) 103 (0.02%) 17 (0.03%) 37 (0.04%) 5 (0.002%) 29 (0.05%)Table 1. Demographics of study cohorts.MethodsStudy populationAccess to whole exome sequencing and phenotypic data from the UKBiobank (UKBB) resource was acquired (application ID: 33751). Informed consent was provided for all participants. Phenotypes were provided at initial assessment and were updated through routinely available national health datasets-®-. From the 500,000 registered UKBB participants, 448,083 participants were selected for analysis based on the availability of whole exome sequencing data and matching phenotypic information. Participants were aged between 37 and 73 years old at time of registration (median age: 58 years old) and 54% were females. Whole genome sequencing and matching phenotypic data was acquired for 50,243 participants from Genomics England (Research Registry ID: RR726). The average age of participants at time of blood extraction was 49 years with 55% Females.A-14Whole genome sequencing data generated by Nashville Biosciences using Illumina technology was accessed via the ICA platform. Of the enrolled participants, 92,211 were selected for inclusion in further analysis on the basis of phenotypic data availability. The average age of participants at time of blood extraction was 60 years with 59% Females. Summary statistics from linear model association tests between genotype categories and disease were provided for 53,345 individuals from MassGen Brigham Biobank (MGBB) (average age at time of blood extraction: 54 years, 56% Females), and 245,333 individuals from the All of Us dataset (average age at time of blood extraction: 56 years, 61% Females). CH variant calling and filteringUKBB cohort whole exome sequencing CRAM files aligned to reference genome hg38 were made available through the DNAnexus portal. A DNAnexus runnable applet was built to generate variant call files (VCFs) containing potential clonal mutations. Hg38 aligned BAM files from Genomics England whole genome sequencing data were made available on the Genomics England research environment and Nashville Biosciences whole genome sequencing data was provided through the ICA platform. For all datasets, the Genome Analysis Toolkit’s (v4.3.0.0) Mutect2 variant caller was run on each CRAM / BAM file, using both a germline resource VCF and panel of normals to compensate for the lack of comparative normal tissue22. Intervals covering the exons of 43 previously defined CH genes were supplied with an additional padding of 10 base pairs2. To allow for the detection of small clones, minimum variant allele fraction (VAF) filters were lowered to 0. FilterMutectCalls tool was run on the resulting VCF, filtering for a minimum of 2 reads covering the alternative allele.Indels were left-aligned and normalised and multiallelic sites were split into multiple rows using bcftools norm (v1.9)2S. Processed VCF files were merged into batches for speed of processing. Variants were annotated with COSMIC IDs and COSMIC allele counts then annotated with Variant Effect Predictor (VEP) (v91.3-0) ---. Batches were merged per dataset to allow for further variant quality filtering.Quality filters were applied to ensure variants had a median GERMQ score (a quality score generated by Mutect2) of greater than 10 and a median POPAF score (a score based on the variants presence in the supplied germline resource VCF) of greater than 3. Variants had to be labelled as “PASS” for the Mutect2 internal filters in at least one occurrence. For genotype filters, missingness was assumed if the site was covered by less than 7 reads, or if the allelic depth was less than 4 or less than 2 for variants that had previously been described at least once in the COSMIC database Using population allele frequency (AF) annotations from GnomAD2, high or moderate impact variants (as labelled by VEP) were filtered to select for those with an AF less than 0.0005. Variants in a list of known CH associated regions or variants seen in 10 or more COSMIC cases were filtered to select for those with an AF less than 0.001. Additional filtering tags were applied to the resulting variants as described below.CNA data for UKBB individuals was downloaded from UKBB return 3094 and El Ds were converted to align with the application specific El Ds used in this study22Median VAF tagVariants are grouped by their unique variant ID and the range of VAFs that have been observed are assessed. A true clonal variant is likely to exist on some but not all haplotype reads and is likely to be heterozygous. Therefore, true clonal mutations are likely to have aA-15VAF of below 0.35 for heterozygous variants, whereas heterozygous germline variants exist with a VAF of around 0.5. Variants are labelled with the ‘PASS’ tag if the median VAF across all observations is less than 0.35.Phasing tagFollowing the above principles, that a true clonal mutation is likely to exist on some but not all haplotype reads, we used phasing to tag likely clonal mutations. On the UKBiobank DNAnexus portal, Genomics England research environment and ICA platform, we pulled all reads covering potential mutations from the original CRAM / BAM files. We then assessed whether those reads also covered a common, heterozygous, germline SNP (SNPs occurring more than 5 times in GnomAD germline database) Using that germline SNP, we divided the reads into haplotypes. Variants were labelled with the ‘PASS’ tag if they existed in only one of the haplotypes, and if, within that haplotype, they did not exist on all reads. This flag was only possible for a small number of variants which coexisted on reads with a common germline SNP. If this was not the case, a variant was labelled as ‘UNCALLABLE’ for this tag.Neighbourhood germline VAF tagApplying a hard variant allele fraction cut-off for individual variants where there may be allelic imbalance during the PGR amplification process, leading to one allele being sequenced more than another, can be problematic--. To mitigate this, we assessed the VAFs of all common, heterozygous, germline SNPs (SNPs occurring more than 5 times in GnomAD germline database) within a 1 megabase region of our variants of interest\Ne applied a cumulative density function to our germline variants to determine the likelihood that a potential clonal variant’s VAF deviates significantly from the germline VAFs in the region. Variants with a significant P value for this test (false discovery rate corrected p value of less than 0.01) were labelled with the ‘PASS’ tag.PrimateAl 3D tagPrimate Al 3D variant annotation tool was used to assess variant pathogenicity3-. Variants were labelled with the ‘PASS’ tag if their primate Al 3D score was above the median score per gene.A final list of clonal variants was created from variants that were labelled as ‘PASS’ by the AF based tag and Median VAF tag, in addition to being labelled ‘PASS’ or ‘UNCALLABLE’ by the phasing tag, neighbourhood germline VAF tag, and PrimateAl 3D tag. Variants that passed the above filter in any tested individual were included in the final analysis.Statistical analysisAssociation between clonal mutations in previously described CH genes and phenotypes was tested using a cox proportional hazard model, correcting for age and sex. Tested ICD10 codes were grouped into related phenotypes (Table S12) where time of disease was taken as the earliest date of diagnosis for ICD10 codes in the phenotype group. Analysis start time was taken as the date of blood sample collection. End time was either the diagnosis date of phenotype, diagnosis date of blood cancer, recorded date of death, or last date of follow-up if none of the previous events occurred. Results of this analysis in UKB, GEL, and NashBio datasets were compiled into a meta-analysis using a fixed effects model from the metafor RA-16library (v4.6-0) Significant genotype phenotype relationships were identified as those with a false discovery rate corrected p value below 0.05 and a hazard ratio above 1.2.Genes with an increasing relationship with both haematological cancer or MDS and SBI as clone size increases were further explored. Within these genes, residues driving strong associations were identified by subsetting carriers of mutations at each commonly mutated amino acid and comparing their association to disease in comparison to non-carriers. Residue based hazard ratios and p values were then compared to those derived from the whole gene to identify any residues which have a stronger effect on disease than the whole gene if applicable.Variant calls affecting genes of interest {DNMT3A, ASXL1, TET2, SRSF2, SF3B1, and IDH2), or specific residues of interest DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 from UKBB, Genomics England, and Nashville Biosciences data were anonymised, including jittering ages at disease diagnosis and blood extraction (by a standard deviation of 0.5) and binning variant allele fractions into ‘small’, ‘medium’, and ‘large’ clone sizes (small: 0 < VAF <= 0.2, medium: 0.2 < VAF <= 0.4, large: 0.4 < VAF < 1). Data from the three studies was compiled into a large mega-analysis of 590,537 samples where survival analysis was performed using a Cox proportional hazard model as described above.To test for a relationship between clone size and disease risk, individuals were defined as having a small, medium, or large sized clone based on VAF (small: 0 < VAF <= 0.2, medium: 0.2 < VAF <= 0.4, large: 0.4 < VAF < 1). Cox proportional hazard models were run per HUGO gene symbol, for small, medium, and large clones separately.Cumulative incidence of haematological cancer or MDS and SBI was assessed using R library casebase (v0.10.6), fitting smooth hazards models on carriers of mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 and non-carriers separately.The association of mutated residues with 28-day mortality with SBI was assessed by performing a Cox proportional hazard model for medium and large sized clones. Individuals with a recorded date of death within 28 days of recorded SBI diagnosis were compared to individuals with SBI and either no recorded date of death or date of death more than 28 days after SBI diagnosis. Relative risk was estimated from the derived hazard ratios by accounting for the baseline probability of the outcome in the tested population (Pp) with the following formula:Hazard ratioRelative Risk =(1 — Pp) + (Pp x Hazard ratio)Carriers of more than one clonal mutation were assessed for their association to disease in comparison to individuals with no identifiable mutation and those with a single mutation via a Cox proportional hazard model. For UKBB dataset only (due to data availability), carriers of CNVs in addition to mutations affecting residues of interest were compared to individuals with no identifiable mutation and those with a single mutation via a Cox proportional hazard model. Bacterial strains driving the CH-SBI relationship were assessed in the NashBio dataset only (due to data availability). Where available, individuals were labelled with an identified bacterial strain where the date of diagnosis for that strain was within 30 days of SBI diagnosis. Carriers of clonal mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 were compared to non-carriers for an association to SBI driven by a specific bacterial strainA-17(including Pseudomonas, Enterococcus, Escherichia Goli, Mycobacteria, Streptococcus, Staphylococcus, or Klebsiella).Validation in additional datasetsTo validate the association between somatic mutations in ASXL1, DNMT3A, TET2, SF3B1, SRSF2, IDH2, and NRAS and bacterial sepsis or pneumonia infections, results were replicated in two additional datasets. For data from All of Us and MGBB cohorts, Samtools Mpileup^ was used to call variants at 19,822 positions in the above genes that passed filters in UKBB, GEL and Nashville Biosciences data as per the previously described specifications. This validation was performed in 53,345 samples from MGBB, and 245,333 samples from the All of Us dataset (Table 1).Due to limitations in data transfer and data availability, summary statistics from a linear model testing an association between the presence of any validated variant in ASXL1, DNMT3A, TET2, SF3B1, and SRSF2 and SBI in UKBiobank, Genomics England, Nashville Biosciences, MassGen Biobank, and All of Us, were aggregated and combined into a meta-analysis using a fixed effects model from the metafor R library (v4.6-0) In each dataset, variants were grouped into genes and tested for their association with haematological cancers or MDS and SBI in a linear model compared to individuals with no detected clone, correcting for sex and age at DNA extraction. Variants were further grouped by clone size using VAF as a marker (small: 0 < VAF <= 0.2, medium: 0.2 < VAF <= 0.4, large: 0.4 < VAF < 1) and tested each association with disease. To test for an increasing association with disease correlating with an increase in clone size, the categorical clone size variable was converted to a continuous variable (no clone: 0, small: 1, medium: 2, large: 3) and tested in a linear model correcting for sex and age at DNA extraction.References1. Pich O, Reyes-Salazar I, Gonzalez-Perez A, Lopez-Bigas N. Discovering the drivers of clonal hematopoiesis. Nat. Commun [internet] 2022:13(1). Available from: http: / / dx.doi. Org / 10.1038 / S41467-022-31878-02- Jaiswai S, Ebert BL. Clonal hematopoiesis in human aging and disease. Science [Internet] 2019:366(6465). Available from: http: / / dx.doi.org 10.1126 / science.aan46733- Bowman RL, Basque L, Levine RL. Clonal Hematopoiesis and Evolution to Hematopoietic Maiiqnancies. Cell Stem Ceil 2018;22(2): 157-70.4- Watson CJ, Papula AL, Poon GYP, et al. The evolutionary dynamics and fitness landscape of clonal hematopoiesis. Science 2020;367(6485):1449-54.5- Liggett LA, Sankaran VG. Unraveling Hematopoiesis through the Lens of Genomics. Cell 2020: 182(6): 1384-400.6- Kessler MD, Damask A, O’Keeffe S, et al. Common and rare variant associations with clonal haematopoiesis phenotypes. Nature 2022:612(79391:301 -9.7. Kar SP, Quiros PM, Gu M, et al. Genome-wide analyses of 200,453 individuals yield new insights into the causes and consequences of clonal hematopoiesis. Nat Genet 2022:54(8): 1155-66.8. Wang BA, Mehta HM, Penumutchu SR, et al. Alternatively spiiced CSF3R isoforms in SRSF2 P95H mutated myeloid neoplasms. Leukemia 2022;36(101:2499-508.9- Kon A, Yamazaki S, Nannya Y, et al. Physiological Srsf2 P95H expression causes impaired hematopoietic stem cell functions and aberrant RNA splicing in mice. Blood 2018:131(61:621-35.10- Smeets MF, Tan SY, Xu JJ, et al. Srsf2 P95H initiates myeloid bias and myelodysplastic / myeloproliferative syndrome from hemopoietic stem cells. Blood 2018:132(61:608-21.11. Fabre MA, de Almeida JG, Fiorilio E, et al. The longitudinal dynamics and natural history of clonal haematopoiesis. Nature 2022:606(79131:335-42.12. Wang L, Lawrence MS, Wan Y, et al. SF3B1 and Other Novel Cancer Genes in Chronic Lymphocytic Leukemia. N Engi J Med 2011;365(26):2497-506.13- Chotirat S, Thongnoppakhun W, Promsuwicha O, Boonthimat C, Auewarakul CU. Molecular alterations of isocitrate dehydrogenase 1 and 2 (IDH1 and IDH2) metabolic genes and additional genetic mutations in newly diagnosed acute myeloid leukemia patients J Hematol Oncol 2012:5(11:5.14. Ashraf S, Noquera Nl, Di Giandomenico J. Zaza S, Hasan SK, Lo-Coco F. Rapid detection of IDH2 (R140Q and R172K1 mutations in acute myeloid leukemia. Ann Hematol 2013:92(101:1319-23.15. Genovese G, Kahler AK, Handsaker RE, et al. Clonal hematopoiesis and blood-cancer risk inferred from blood DNA sequence. N Engl J Med 2014:371(261:2477-87.16. Jaiswai S, Fontanilias P, Flannick J, et al. Age-Related Clonal Hematopoiesis Associated with Adverse Outcomes. N Engl J Med 2014:371(261:2488-98.17. Abeison S, Coilord G, Ng SWK, et al. Prediction of acute myeloid leukaemia risk in healthy individuals. Nature 2018:559(77141:400-4.18. Desai R Mencia-Trinchant N, Savenkov O, et al Somatic mutations precede acute myeloid leukemia years before diagnosis. Nat Med 2018:24(7): 1015-23.19. Jaiswai S, Natarajan R Silver AJ, et al. Clonal Hematopoiesis and Risk of Atherosclerotic Cardiovascular Disease N Engl J Med 2017:377(2}: 111 - 21.20. Bhattacharya R, Zekavat SM, Haessler J, et al. Clonal Hematopoiesis is Associated With Higher Risk of Stroke. Stroke 2022:53(31:788-97.21 ■ Miller PG, Qiao D, Rojas-Quintero J, et al. Association of clonal hematopoiesis with chronic obstructive pulmonary disease Blood 2022; 139(31:357-6822. Vlasschaert C, Robinson-Cohen C, Chen J, et al. Clonal hematopoiesis of indeterminate potential is associated with acute kidney injury. Nat Med [Internet] 2024: Availabie from: http: / / dx.doi. Org / 10.1038 / S41591 -024-02854-623. Wong WJ, Emdin C, Bick AG, et al. Clonal haematopoiesis and risk of chronic liver disease. Nature 2023:616(79581:747-54.A-1924. Weeks LD, Ebert BL Causes and consequences of clonal hematopoiesis Blood 2023; 142(26):2235-46.25. Arends CM, Weiss M, Christen F, et al. Clonal hematopoiesis in patients with antineutrophil cytoplasmic antibody-associated vasculitis. Haematoloqica 2020; 105(6):e264-7.26. Bernstein N. Spencer Chapman M. Nyamondo K, et ai. Analysis of somatic mutations in whole blood from 200,618 individuals identifies pervasive positive selection and novel drivers of clonal hematopoiesis Nat. Genet 2024:56(6); 1147-55.27. Zekavat SM, Lin S-H, Bick AG, et ai Hematopoietic mosaic chromosomal alterations increase the risk for diverse types of infection. Nat Med 2021;27(6):1012-24.28. Vlasschaert C, Akwo E, Robinson-Cohen C, et al. infection risk associated with clonal hematopoiesis of indeterminate potential is partly mediated bv hematologic cancer transformation in the UK Biobank. Leukemia [Internet] 2023; Available from: http: / / dx.doi. Org / 10.1038 / S41375-023-02023-729. Sudlow C, Gallacher J. Allen N, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med [Internet] 2015;12(3). Available from: htp: / / dx.doi.org / 10.1371 / journal.pmed.1001779 30. Turnbull C, Scot RH. Thomas E, et al. The 100000 Genomes Project: bringing whole genome sequencing to the NHS. BMJ 2018;361:k1687.31 ■ Boutin NT, Schecter SB, Perez EF, et al. The evolution of a large Biobank at mass general Brigham. J Pers Med 2022; 12(8): 1323.32. Ramirez AH, Suiieman L, Schlueter DJ, et al. The All of Us Research Program: Data quality, utility, and diversity. Patterns (N Y) 2022:3(8): 100570.33. Van der Auwera CT O’Connor BD. Genomics in the Cioud: Using Docker. GATK, and WDL in Terra. 1st Edition. O’Reilly Media; 2020.34. Gao H, Hamp T, Ede J, et al. The landscape of tolerated genetic variation in humans and primates. Science [internet] 2023:380(6648) Avaiiable from: http: / / dx.d0i.0rg / l 0.1126 / science.abn819735. Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Ceil 2011;144(5):646-74.36. Kim E, Hagan JO, Liang Y et al SRSF2 Mutations Contribute to Myelodysplasia by Mutant-Specific Effects on Exon Recognition. Cancer Cell 2015;27(5):617-30.37. Bapat A, Keita N, Marteily W, et al. Myeloid Disease Mutations of Splicing Factor SRSF2 Cause G2-M Arrest and Skewed Differentiation of Human Hematopoietic Stem and Progenitor Cells. Stem Cells 2018:36(11): 1663-75.38. Cockey SG, Zhang H, Hussaini MO, et ai. A Large Cohort Study of 412 Patients with SRSF2 / TET2 Co-Mutated Myeloid Neoplasms: The Molecular Landscape and Clinical Outcomes. Biood 2023;142(Supplement 1):1882-1882.A-2039. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011;27(21):2987-93.40. Sondka Z, Dhir NB, Carvalho-Silva D, et ai. COSMIC: a curated database of somatic variants and clinical data for cancer. Nucleic Acids Res 2024;52(D1): D1210-7.41 ■ McLaren W, Gil L, Hunt SE, et al. The Ensembi Variant Effect Predictor. Genome Biol 2016;17(1):1-14.42. Chen S' Francioli LC, Goodrich JK, et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 2024:625(7993):92-100.43. Patrinos GP, Lazaro C, Lerner-Ellis J, Spurdie A. Clinical DNA Variant Interpretation.1sst ed. Academic Press: 2021.44. Viechtbauer W. Conducting meta-analvses in R with the metafor package. J Stat Softw 2010;36(3):1-48.45. Li H, Handsaker B, Wysoker A, et ai. The Sequence Alignment / Map ( SAM ) Format and SAMtools 1000 Genome Project Data Processing Subgroup. Bioinformatics 2009:25(16): 1-2.Appendix BNot ObservedFiltered OutRetained4030CH Variant Non-Carrier Carrier (6.9%)cCommonAny sdm diseaseancsrSsnouadsnsASXLiZRSR2’SF3B1<SRSF2&'V $y <& 4>Haematological cancer or MDSASXL1 DNMT3A TET2 SF3B1 SRSF2B-2Figure 1: CH is associated with development of a serious bacterial infection (SBI) in a 590,537 individual exploration cohort. A. Individuals with retained CH variants are older than individuals with no observed CH variant. Individuals with suggested CH variants that were later filtered out have comparable ages to those with no observed CH variants. B. Most frequently mutated CH genes and somatic variant annotations. C. Somatic mutations in CH genes increase risk of cancer, infection, and other diseases. D. Risk of haematological cancer and SBI increases with clone size. Small clones; 0 < VAF <= 0.1, medium clones; 0.1 < VAF <= 0.4, large clones; 0.4 < VAF <= 1. E. Cumulative incidence of haematological cancer or MDS and SBI is greater in variant carriers in compari-son to non-carriers. SBI in carriers occurs in individuals with no record of haematological cancer or MDS. F. CH variants are associ-ated with SBI in individuals with a prior haematological cancer or MDS diagnosis as well as all individuals. G. Recurrent mutations at specific amino acids reduce the likelihood of SBI free survival as clone size increases. For example, 25% of individuals with large clones carrying mutations at P95 in SRSF2 develop SBI within 10 years.B-3ADiscovery and validation cohortB28- ay mortality with serious bacteria! infectionCH twfetaB-48 W W Time tom genetyping (years)SRSF2 P95 mutationB-5S IQ IS Time from genotyping (years)B-6UKBBTime from genotyping (years)B-7vuycFigure 2: A. Risk of haematological cancer or MDS and serious bacterial infection (SBI) increases with clone size in validation cohort and discovery cohort (889,215 individuals). Small clones; 0 < VAF <= 0.1, medium clones; 0.1 < VAF <= 0.4, large clones; 0.4 < VAF <= 1. B. Recurrent mutations at any of DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12, or at individual residues increases the likelihood of mortality within 28 days of a SBI diagnosis. C. Clonal mutations in TET2 and IDH2 found in combination with SRSF2 P95 have a greater impact on SBI-free survival than SRSF2 P95 alone. D. Clones carrying secondary mutations alongside SRSF2 P95 are often larger in size as represented by variant allele fraction. E. Clonal mutations in SF3B1 and those affecting SRSF2 P95 found in combination with a TET2 mutation have a greater impact on SBI-free survival than TET2 or SF3B1 mutations alone. F. In the UKBB cohort, risk of SBI is higher at all clone sizes in carriers of mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 (SNP) than in carriers of a clonal copy number variant (CNV). G. In the UKBB cohort, clones carrying both a CNV and a mutation at positions DNMT3A R882, SF3B1 K700, or SRSF2 P95 have a reduced likelihood of SBI-free survival than those carrying either mutation type alone. H. In the VUMC cohort, clonal mutations at DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, or NRAS G12 are most strongly associated with bacterial sepsis and pneumonia caused by Staphylococcus, Streptococcus, and Klebsiella strains. Gram negative (dark green) and positive (light green) strains are distributed evenly with varying levels of significance.B-8448,083 filtered participants 50,243 filtered participants 92,211 filtered participants Average age: 58 years Average age: 49 years Average age: 60 years 54% females 55% females 59% femalesMedian GERMQ > 10 Median POPAF > 3 AD > 7 OR AD > 4 for COSMIC variants GnomAD AF < 00005 OR < 0001 for COSMIC variants19,822 unique variants detected across 43 CH genes Meta-analysis: Identification of relationship between CH and serious bacterial infections Bacterial sepsis / pneumonia associated variants detected:DNMT3A R882, SRSF2 P95, SF3B1 K700, IDH2 R140, NRAS G12 036% Mutation rate Mega-analysis: Disease association increases with clone size 53;:345::fi:ltd®d::i^rt:i<;:ipart8::2^t333:fiiteited:ft>airti: CJip>arrtS: ValasWaffing Samtools Mpileup over 19,822 filtered variants:l:l:l:l:l:l:l:: A$::dbdV8:f6f: AD:artd::0P:: M^ Bacterial sepsis / pneumonia associated variants detected:DNMT3A R332, SRSF2 P95, SF3B1 K700, IDH2 R140, NRAS G12 037% Mutation rate Meta-analysis: Validation of relationship betweenclone size and serious bacterial infectionsSupplementary Figure 1: An overview of the analysis pipeline included in this study, showing the makeup of cohorts involved in both discovery and validation phases.B-9A Genomics England Genomics England UK BioBank P = 1 48x10299Non-carrier CH variant Non-carrier CH variant Non-carrier CH variant carrier (4 86%) carrier (486%) carrier (486%)B-1OCommon diseases Cancer Severe infectionSupplementary Figure 2: A. Individuals with retained CH variants are older than individuals with no observed CH variant in individual cohorts. B. Somatic mutations in CH genes increase risk of cancer, infection, and other diseases in individual cohorts. C. Somatic mutations in CH genes increase risk of cancer, infection, and other diseases in metaanalysis of logistic regression tests in the discovery cohort.B-11

Claims

CLAIMSWe claim:

1. A system comprising:at least one processor; anda non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;determine that a candidate variant from among the variant calls satisfies one or more threshold variant allele fractions (VAFs) based on a variant allele fraction (VAF) of the candidate variant among the nucleotide reads;determine, from phased nucleotide reads for the target genomic region of the genomic sample, the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype;determine that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, wherein the subset of germline variants are located within a threshold number of nucleobases of the candidate variant; and classify the candidate variant as a somatic variant based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to:determine that a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score; andclassify the candidate variant as a somatic variant further based on the pathogenicity score of the candidate variant satisfying the threshold pathogenicity score.

3. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine that the candidate variant satisfies the one or more threshold VAFs by determining that the VAF of the candidate variant is less than or equal to a lower threshold VAF.

554. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine that the candidate variant satisfies the one or more threshold VAFs by determining that the VAF of the candidate variant exceeds or equals an upper threshold VAF.

5. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine that the candidate variant segregates into the portion of the phased nucleotide reads by:identifying, from among the nucleotide reads, a set of nucleotide reads carrying one or more germline variants;determining, based on the one or more germline variants, a first subset of nucleotide reads corresponding with a first parental haplotype and a second subset of nucleotide reads corresponding with a second parental haplotype, wherein the phased nucleotide reads comprise the first subset of nucleotide reads and the second subset of nucleotide reads; anddetermining that the candidate variant segregates into the portion of the first subset of nucleotide reads or the portion of the second subset of nucleotide reads.

6. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine that the VAF of the candidate variant deviates from the VAF for the subset of germline variants within the deviation threshold by:generating, utilizing a density function, a distribution of the VAFs of the subset of germline variants;determining a probability of observing the VAF of the candidate variant given the distribution of the VAFs of the subset of germline variants; anddetermining that the probability of observing the VAF of the candidate variant is within the deviation threshold.

7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to:determine that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads; ordetermine that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score;56determine that the VAF of the additional candidate variant deviates from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; andclassify the additional candidate variant as a somatic variant based on the VAF of the additional candidate variant deviating from the VAFs of the additional subset of germline variants within the deviation threshold.

8. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to:determine that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads;determine that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score; ordetermine that the VAF of the additional candidate variant does not deviate from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; anddetermine, from the phased nucleotide reads for the target genomic region of the genomic sample, that the additional candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype; andclassify the additional candidate variant as a somatic variant based on the additional candidate variant segregating into a portion of the phased nucleotide reads for the one parental haplotype.

9. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to identify, based on classifying the candidate variant as a somatic variant, a biochemical compound for treatment.

10. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to provide, for display via a user interface of a client device, an indicator that a subject corresponding with the genomic sample has clonal hematopoiesis of indeterminate potential (CHIP), wherein the genomic sample comprises deoxyribonucleic acid (DNA) from a blood sample or a saliva sample.5711. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to determine the candidate variant by:determining that the candidate variant is in a known clonal-hematopoiesis-associated region in a threshold number of samples from a mutation database;determining that an allele frequency of the candidate variant is less than or satisfies a threshold population allele frequency for the threshold number of samples; andselecting, based on the allele frequency of the candidate variant satisfying the threshold population allele frequency, the candidate variant for labeling derived from the one or more threshold VAFs, segregation into one parental haplotype, and the deviation threshold.

12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to classify the candidate variant as a somatic variant by classifying the candidate variant as a clonal variant.

13. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;determining that a candidate variant from among the variant calls satisfies one or more threshold variant allele fractions (VAFs) based on a variant allele fraction (VAF) of the candidate variant among the nucleotide reads;determining, from phased nucleotide reads for the target genomic region of the genomic sample, the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype;determining that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, wherein the subset of germline variants are located within a threshold number of nucleobases of the candidate variant; andclassifying the candidate variant as a somatic variant based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

14. The computer-implemented method of claim 13, further comprising: determining that a pathogenicity score of the candidate variant satisfies a threshold pathogenicity score; and58classifying the candidate variant as a somatic variant further based on the pathogenicity score of the candidate variant satisfying the threshold pathogenicity score.

15. The computer-implemented method of claim 13, wherein determining that the candidate variant satisfies the one or more threshold VAFs comprises determining that the VAF of the candidate variant is less than or equal to a lower threshold VAF.

16. The computer-implemented method of claim 13, wherein determining that the candidate variant satisfies the one or more threshold VAFs comprises determining that the VAF of the candidate variant exceeds or equals an upper threshold VAF.

17. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;determine that a candidate variant from among the variant calls satisfies one or more threshold variant allele fractions (VAFs) based on a variant allele fraction (VAF) of the candidate variant among the nucleotide reads;determine, from phased nucleotide reads for the target genomic region of the genomic sample, the candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype;determine that the VAF of the candidate variant deviates from VAFs of a subset of germline variants within a deviation threshold, wherein the subset of germline variants are located within a threshold number of nucleobases of the candidate variant; andclassify the candidate variant as a somatic variant based on the VAF of the candidate variant satisfying the one or more threshold VAFs, the candidate variant segregating into the portion of the phased nucleotide reads, and the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

18. The non-transitory computer-readable medium of claim 17, further comprising instructions that, when executed by the at least one processor, cause the computing device to: determine that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads; ordetermine that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score;determine that the VAF of the additional candidate variant deviates from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; andclassify the additional candidate variant as a somatic variant based on the VAF of the additional candidate variant deviating from the VAFs of the additional subset of germline variants within the deviation threshold.

19. The non-transitory computer-readable medium of claim 17, further comprising instructions that, when executed by the at least one processor, cause the computing device to: determine that an additional candidate variant from among the variant calls fails to satisfy the one or more threshold VAFs based on a VAF of the additional candidate variant among the nucleotide reads;determine that an additional pathogenicity score of the additional candidate variant fails to satisfy a threshold pathogenicity score; ordetermine that the VAF of the additional candidate variant does not deviate from VAFs of an additional subset of germline variants by the deviation threshold, wherein the additional subset of germline variants are located within the threshold number of nucleobases of the additional candidate variant; anddetermine, from the phased nucleotide reads for the target genomic region of the genomic sample, that the additional candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype; andclassify the additional candidate variant as a somatic variant based on the additional candidate variant segregating into a portion of the phased nucleotide reads for the one parental haplotype.

20. The non-transitory computer-readable medium of claim 17, further comprising instructions that, when executed by the at least one processor, cause the computing device to provide, for display via a user interface of a client device, an indicator that a subject corresponding with the genomic sample has clonal hematopoiesis of indeterminate potential (CHIP), wherein the genomic sample comprises deoxyribonucleic acid (DNA) from a blood sample or a saliva sample.

21. A system comprising:at least one processor; anda non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;phase the nucleotide reads for the target genomic region of the genomic sample according to a set of parental haplotypes;determine, from the phased nucleotide reads for the target genomic region of the genomic sample, a candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype of the set of parental haplotypes; andclassify the candidate variant as a somatic variant based on the candidate variant segregating into the portion of the phased nucleotide reads.

22. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;phasing the nucleotide reads for the target genomic region of the genomic sample according to a set of parental haplotypes;determining, from the phased nucleotide reads for the target genomic region of the genomic sample, a candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype of the set of parental haplotypes; andclassifying the candidate variant as a somatic variant based on the candidate variant segregating into the portion of the phased nucleotide reads.

23. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;phase the nucleotide reads for the target genomic region of the genomic sample according to a set of parental haplotypes;determine, from the phased nucleotide reads for the target genomic region of the genomic sample, a candidate variant segregates into a portion of the phased nucleotide reads for one parental haplotype of the set of parental haplotypes; andclassify the candidate variant as a somatic variant based on the candidate variant segregating into the portion of the phased nucleotide reads.

24. A system comprising:at least one processor; anda non-transitory computer-readable medium storing instructions that, when executed by the at least one processor, cause the system to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;identify a subset of germline variants located within a threshold number of nucleobases of a candidate variant of the variant calls;determine that a VAF of the candidate variant deviates from variant allele fractions (VAFs) of the subset of germline variants within a deviation threshold; andclassify the candidate variant as a somatic variant based on the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

25. A computer-implemented method comprising:generating, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;identifying a subset of germline variants located within a threshold number of nucleobases of a candidate variant of the variant calls;determining that a VAF of the candidate variant deviates from variant allele fractions (VAFs) of the subset of germline variants within a deviation threshold; andclassifying the candidate variant as a somatic variant based on the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.

26. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to:generate, based on nucleotide reads from a sequencing device and for a target genomic region of a genomic sample, variant calls for the target genomic region of the genomic sample;identify a subset of germline variants located within a threshold number of nucleobases of a candidate variant of the variant calls;62determine that a VAF of the candidate variant deviates from variant allele fractions (VAFs) of the subset of germline variants within a deviation threshold; andclassify the candidate variant as a somatic variant based on the VAF of the candidate variant deviating from the VAF of the subset of germline variants by the deviation threshold.63