Methods for cfdna methylation data modeling using transformer neural network architecture
A transformer neural network architecture improves cancer detection in liquid biopsies by analyzing methylation patterns in nucleic acids, enhancing sensitivity and accuracy in tumor classification.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- GUARDANT HEALTH INC
- Filing Date
- 2025-12-17
- Publication Date
- 2026-06-25
AI Technical Summary
Existing cancer detection methods using liquid biopsies are hindered by the low concentration and heterogeneity of nucleic acids in body fluids, making it difficult to accurately classify samples for tumor-derived DNA with high sensitivity.
A computational framework utilizing a transformer neural network architecture is employed to analyze methylation patterns in nucleic acid molecules, training a machine learning model to predict the presence of tumors by generating tokens for genomic regions and performing classification processes based on sequencing data.
Enhances the sensitivity and accuracy of cancer detection by classifying tumor-derived DNA, enabling non-invasive early detection and monitoring treatment efficacy through machine learning models.
Smart Images

Figure US2025060140_25062026_PF_FP_ABST
Abstract
Description
Attorney Ref. No.: GH0230WOMETHODS FOR CFDNA METHYLATION DATA MODELING USING TRANSFORMER NEURAL NETWORK ARCHITECTURECROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The present application claims priority to and incorporates by reference in its entirety for all purposes, the patent application number 63 / 735,137 filed December 17, 2025.BACKGROUND
[0002] Cancer is a major cause of disease worldwide. Each year, tens of millions of people are diagnosed with cancer around the world, and more than half eventually die from it. In many countries, cancer ranks the second most common cause of death following cardiovascular diseases. Early detection is associated with improved outcomes for many cancers.
[0003] Cancer can be caused by the accumulation of genetic variations within an individual's normal cells, at least some of which result in improperly regulated cell division. Such variations commonly include copy number variations (CNVs), single nucleotide variations (SNVs), gene fusions, insertions and / or deletions (indels), epigenetic variations including 5-methylation of cytosine (5-methylcytosine) and association of DNA with chromatin and transcription factors.
[0004] Cancers are often detected by biopsies of tumors followed by analysis of cells, markers or DNA extracted from cells. But more recently it has been proposed that cancers can also be detected from cell-free nucleic acids in body fluids, such as blood or urine. Such tests have the advantage that they are noninvasive and can be performed without identifying suspected cancer cells in biopsy. However, such tests are complicated by the fact that the amount of nucleic acids in body fluids is very low and what nucleic acids are present are heterogeneous in form (e.g., RNA and DNA, single-stranded and double-stranded, and various states of post-replication modification and association with proteins, such as histones).
[0005] Thus, there is a need for improved systems and methods for improved cancer detection using liquid biopsy assays. Therefore, it is an object of the disclosure to provide computer-implemented systems and methods that have improved capability to classify a sample as containing tumor-derived DNA with heightened sensitivity.BRIEF DESCRIPTION OF THE DRAWINGS
[0006] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate certain implementations, and together with the written description, serve to explain certain principles of the methods, computer readable media, and systems disclosedAttorney Ref. No.: GH0230WOherein. The description provided herein is better understood when read in conjunction with the accompanying drawings which are included by way of example and not by way of limitation. It will be understood that like reference numerals identify like components throughout the drawings, unless the context indicates otherwise. It will also be understood that some or all of the figures may be schematic representations for purposes of illustration and do not necessarily depict the actual relative sizes or locations of the elements shown.
[0007] Figure 1 is a diagrammatic representation of an example computational architecture that implements a generative machine learning model to detect the presence of a biological condition in subjects, according to one or more example implementations.
[0008] Figure 2 is a diagrammatic representation of an example framework to produce tokens for a generative machine learning model based on classification region quantitative measures, according to one or more example implementations.
[0009] Figure 3 is a diagrammatic representation of an example framework to implement a transformer machine learning architecture and a classification model to detect the presence of a biological condition in subjects, according to one or more example implementations.
[0010] Figure 4 is a diagrammatic representation of a framework to train a transformer based machine learning architecture and a classification model to detect the presence of a tumor in a subject, in accordance with one or more example implementations.
[0011] Figure 5 illustrates a framework to determine classification regions to be used to identify sequence representations to be analyzed to determine an indication of a biological condition for subjects, according to one or more example implementations.
[0012] Figure 6 is a diagrammatic representation of a process to implement a generative machine learning architecture to detect the presence of a tumor in subjects, in accordance with one or more example implementations.
[0013] Figure 7 is a block diagram illustrating components of a machine, in the form of a computer system, that may read and execute instructions from one or more machine-readable media to perform any one or more methodologies described herein, in accordance with one or more example implementations.
[0014] Figure 8 is a block diagram illustrating a representative software architecture that may be used in conjunction with one or more hardware architectures described herein, in accordance with one or more example implementations.Attorney Ref. No.: GH0230WOSUMMARY
[0015] In one aspect, a method includes obtaining, by a computing system including one or more computing devices each including processing resources and memory, training data, the training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples, performing, by the computing system and based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions, determining, by the computing system, output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model, and performing, by the computing system and based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to one or more biological conditions being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.
[0016] The method may also include obtaining, by the computing system, sequencing data derived from a number of samples, the sequencing data corresponding to nucleic acid molecules included in the number of samples, and determining, by the computing system and based on the sequencing data, quantitative measures for a number of genomic regions, individual quantitative measures corresponding to a number of the nucleic acid molecules corresponding to an individual genomic region of the number of genomic regions.
[0017] The method may also include determining, by the computing system and based on the sequencing data, mutant allele fractions for the number of genomic regions, and determining, by the computing system, an additional quantitative measure for individual genomic regions of the number of genomic regions, the additional quantitative measure indicating a level of correlation between the individual quantitative measure and the mutant allele fraction for the individual genomic region.Attorney Ref. No.: GH0230WO
[0018] The method may also include determining, by the computing system, a first subset of the number of regions having quantitative measures that are at least a first threshold value, determining, by the computing system, a second subset of the number of regions having additional quantitative measures that are at least a second threshold value, and determining, by the computing system, the plurality of regions related to the training data by combining the first subset of the number of regions and the second subset of the number of regions.
[0019] The method may also include where determining the quantitative measures includes for individual genomic regions of the plurality of genomic regions, determining, by the computing system, a first number of the amount of nucleic acid molecules that correspond to the individual genomic regions, for individual control genomic regions, determining, by the computing system, a second number of the amount of nucleic acid molecules that correspond to the individual control genomic regions, where the individual control genomic regions include genomic regions having a minimum number of methylated cytosine-guanine dinucleotides in subjects in which a tumor is not present, and performing, by the computing system, a transformation of the first number of the amount of nucleic acid molecules with respect to the second number of the amount of nucleic acid molecules.
[0020] The method may also include for individual regions of the plurality of regions, determining, by the computing system, a range of values of the quantitative measures that correspond to the individual regions, and determining, by the computing system, a subset of the values of the quantitative measures that correspond to individual partitions of a number of partitions corresponding to the individual regions.
[0021] The method may also include where the number of partitions are distributed such that individual partitions of the number of partitions correspond to a same number of the amount of nucleic acid molecules included in the training data.
[0022] The method may also include for individual first training samples of the plurality of first training samples: for individual genomic regions of the plurality of genomic regions, determining, by the computing system, a partition of the number of partitions based on a number of the amount of nucleic acid molecules derived from the individual first training sample that correspond to the individual genomic region, and determining a number of tokens, individual tokens of the number of tokens corresponding to the partition of the number of partitions for the individual genomic region, where the training data includes the number of tokens for the plurality of genomic regions for the plurality of first training samples.
[0023] The method may also include where the second training process is performed with respect to labeled training data that includes first sequence representations derived from a first portion ofAttorney Ref. No.: GH0230WOthe second training samples and second sequence representations derived from a second portion of the second training samples, the first portion of the second training samples being derived from first subjects in which a tumor is not detected and the second portion of the second training samples being derived from second subjects in which one or more cancer types or subtypes have been detected.
[0024] The method may also include where the one or more classifications include a first classification indicating that a tumor is present in a subject and a second classification indicating that a tumor is not present in a subject.
[0025] The method may also include where the one or more classifications include a first classification indicating a first cancer type and a second classification indicating a second cancer type.
[0026] The method may also include where the one or more classifications correspond to one or more tumor fraction values.
[0027] The method may also include where the one or more classifications correspond to a level of homology directed repair with respect to a test subject.
[0028] The method may also include obtaining, by the computing system, first test sequencing data derived from one or more first test samples obtained from a test subject at a first time, determining, by the computing system, a first classification for the test subject based on implementing the second machine learning model with respect to the first test sequencing data, obtaining, by the computing system, second test sequencing data derived from one or more second test samples obtained from the test subject at a second time, and determining, by the computing system, a second classification for the test subject based on implementing the second machine learning model with respect to the second test sequencing data.
[0029] The method may also include determining an amount of progression of cancer or an amount of regression of cancer based on an amount of difference between the first classification and the second classification.
[0030] The method may also include where the first time is before administering one or more treatments to the test subject and the second time is after administering the one or more treatments to the test subject.
[0031] The method may also include determining a level of effectiveness of the one or more treatments based on an amount of difference between the first classification and the second classification.
[0032] The method may also include determining an indication of minimum residual disease based on an amount of difference between the first classification and the second classification.Attorney Ref. No.: GH0230WO
[0033] The method may also include where the one or more classifications correspond to a presence of one or more genomic mutations present in nucleic acid molecules derived from samples obtained from subjects.
[0034] The method may also include where the second machine learning model implements one or more statistical classification models or one or more machine learning classification models that are different from the number of transformer blocks of the first machine learning model.
[0035] The method may also include where the first number of neurons corresponds to a number of tokens produced based on sequencing data obtained from one or more subjects and the second number of neurons corresponds to the one or more classifications.
[0036] The method may also include where: the number of transformer blocks of the first machine learning model are arranged in a sequence having a first transformer block and a last transformer block with output activations from an individual transformer block being provided as input activations to a next transformer block in the sequence, the number of transformer blocks include a first number of neurons, the second machine learning model includes an additional transformer block having input activations that correspond to output activations of a next to last transformer block of the sequence, and the additional transformer block includes a second number of neurons that is different from the first number of neurons.
[0037] The method may also include after performing the second training process obtaining, by the computing system, test sequencing data derived from a test sample obtained from a test subject, determining, by the computing system and based on the test sequencing data, quantitative measures for the plurality of genomic regions, individual quantitative measures indicating a number of nucleic acid molecules included in the test sample that (i) correspond to an individual genomic region of the plurality of genomic regions and (ii) have a number of methylated CpGs that correspond to the threshold amount of CpGs, determining, by the computing system and based on the quantitative measures, tokens for the test sample, individual tokens indicating a partition of a plurality of partitions for an individual genomic region of the plurality of genomic regions, providing, by the computing system, the tokens as input to the second machine learning model, and determining, by the computing system and based on implementing the second machine learning model with respect to the tokens, a classification of the one or more classifications for the test subject.
[0038] In one or more aspects, a computing apparatus includes one or more processors and memory storing computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising obtaining training data indicating an amount of nucleic acid molecules that overlap with individual genomicAttorney Ref. No.: GH0230WOregions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples; performing, based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions; determining output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model; and performing, based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to one or more biological conditions being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.
[0039] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising obtaining sequencing data derived from a number of samples, the sequencing data corresponding to nucleic acid molecules included in the number of samples, and determining, based on the sequencing data, quantitative measures for a number of genomic regions, individual quantitative measures corresponding to a number of the nucleic acid molecules corresponding to an individual genomic region of the number of genomic regions.
[0040] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising determining, based on the sequencing data, mutant allele fractions for the number of genomic regions, and determining an additional quantitative measure for individual genomic regions of the number of genomic regions, the additional quantitative measure indicating a level of correlation between the individual quantitative measure and the mutant allele fraction for the individual genomic region.
[0041] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising determining a first subset of the number of regions having quantitative measures that are at least a first threshold value, determining a second subset of the number of regions having additional quantitative measures that are at least a second thresholdAttorney Ref. No.: GH0230WOvalue, and determining the plurality of regions related to the training data by combining the first subset of the number of regions and the second subset of the number of regions.
[0042] The computing apparatus may also include where determining the quantitative measures includes for individual genomic regions of the plurality of genomic regions, determining a first number of the amount of nucleic acid molecules that correspond to the individual genomic regions, for individual control genomic regions, determining a second number of the amount of nucleic acid molecules that correspond to the individual control genomic regions, where the individual control genomic regions include genomic regions having a minimum number of methylated cytosine-guanine dinucleotides in subjects in which a tumor is not present, and performing a transformation of the first number of the amount of nucleic acid molecules with respect to the second number of the amount of nucleic acid molecules.
[0043] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising for individual regions of the plurality of regions, determining a range of values of the quantitative measures that correspond to the individual regions, and determining a subset of the values of the quantitative measures that correspond to individual partitions of a number of partitions corresponding to the individual regions.
[0044] The computing apparatus may also include where the number of partitions are distributed such that individual partitions of the number of partitions correspond to a same number of the amount of nucleic acid molecules included in the training data.
[0045] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising for individual first training samples of the plurality of first training samples: for individual genomic regions of the plurality of genomic regions, determining a partition of the number of partitions based on a number of the amount of nucleic acid molecules derived from the individual first training sample that correspond to the individual genomic region, and determining a number of tokens, individual tokens of the number of tokens corresponding to the partition of the number of partitions for the individual genomic region, where the training data includes the number of tokens for the plurality of genomic regions for the plurality of first training samples.
[0046] The computing apparatus may also include where the second training process is performed with respect to labeled training data that includes first sequence representations derived from a first portion of the second training samples and second sequence representations derived from a second portion of the second training samples, the first portion of the secondAttorney Ref. No.: GH0230WOtraining samples being derived from first subjects in which a tumor is not detected and the second portion of the second training samples being derived from second subjects in which one or more cancer types or subtypes have been detected.
[0047] The computing apparatus may also include where the one or more classifications include a first classification indicate that a tumor is present in a subject and a second classification indicating that a tumor is not present in a subject
[0048] The computing apparatus may also include where the one or more classifications include a first classification indicate a first cancer type and a second classification indicating a second cancer type.
[0049] The computing apparatus may also include where the one or more classifications correspond to one or more tumor fraction values.
[0050] The computing apparatus may also include where the one or more classifications correspond to a level of homology directed repair with respect to a test subject.
[0051] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising obtaining first test sequencing data derived from one or more first test samples obtained from a test subject at a first time, determining a first classification for the test subject based on implementing the second machine learning model with respect to the first test sequencing data, obtaining second test sequencing data derived from one or more second test samples obtained from the test subject at a second time, and determining a second classification for the test subject based on implementing the second machine learning model with respect to the second test sequencing data.
[0052] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising determining an amount of progression of cancer or an amount of regression of cancer based on an amount of difference between the first classification and the second classification.
[0053] The computing apparatus may also include where the first time is before administering one or more treatments to the test subject and the second time is after administering the one or more treatments to the test subject.
[0054] The computing apparatus may also include includes additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising determining a level of effectiveness of theAttorney Ref. No.: GH0230WOone or more treatments based on an amount of difference between the first classification and the second classification.
[0055] The computing apparatus may also include includes additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising determining an indication of minimum residual disease based on an amount of difference between the first classification and the second classification.
[0056] The computing apparatus may also include where the one or more classifications correspond to a presence of one or more genomic mutations present in nucleic acid molecules derived from samples obtained from subjects.
[0057] The computing apparatus may also include where the second machine learning model implements one or more statistical classification models or one or more machine learning classification models that are different from the number of transformer blocks of the first machine learning model.
[0058] The computing apparatus may also include where: the number of transformer blocks of the first machine learning model are arranged in a sequence having a first transformer block and a last transformer block with output activations from an individual transformer block being provided as input activations to a next transformer block in the sequence, the number of transformer blocks include a first number of neurons, the second machine learning model includes an additional transformer block having input activations that correspond to output activations of a next to last transformer block of the sequence, and the additional transformer block includes a second number of neurons that is different from the first number of neurons.
[0059] The computing apparatus may also include where the first number of neurons corresponds to a number of tokens produced based on sequencing data obtained from one or more subjects and the second number of neurons corresponds to the one or more classifications.
[0060] The computing apparatus may also include additional computer-readable instructions that, when executed by the one or more processors, cause the one or more processors to perform additional operations comprising after performing the second training process obtaining test sequencing data derived from a test sample obtained from a test subject, determining, based on the test sequencing data, quantitative measures for the plurality of genomic regions, individual quantitative measures indicating a number of nucleic acid molecules included in the test sample that (i) correspond to an individual genomic region of the plurality of genomic regions and (ii) have a number of methylated CpGs that correspond to the threshold amount of CpGs, determining, based on the quantitative measures, tokens for the test sample, individual tokens indicating aAttorney Ref. No.: GH0230WOpartition of a plurality of partitions for an individual genomic region of the plurality of genomic regions, providing the tokens as input to the second machine learning model, and determining, based on implementing the second machine learning model with respect to the tokens, a classification of the one or more classifications for the test subject. Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.
[0061] In one or more aspects, one or more non-transitory computer-readable storage media including computer-readable instructions that when executed by a computer, cause the computer to perform operations comprising obtaining training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples; performing, based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions; determining output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model; and performing, based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to one or more biological conditions being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.
[0062] The computer-readable storage medium may also include additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising obtaining sequencing data derived from a number of samples, the sequencing data corresponding to nucleic acid molecules included in the number of samples, and determining, based on the sequencing data, quantitative measures for a number of genomic regions, individual quantitative measures corresponding to a number of the nucleic acid molecules corresponding to an individual genomic region of the number of genomic regions.
[0063] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising determining, based on the sequencing data, mutant alleleAttorney Ref. No.: GH0230WOfractions for the number of genomic regions, and determining an additional quantitative measure for individual genomic regions of the number of genomic regions, the additional quantitative measure indicating a level of correlation between the individual quantitative measure and the mutant allele fraction for the individual genomic region.
[0064] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising determining a first subset of the number of regions having quantitative measures that are at least a first threshold value, determining a second subset of the number of regions having additional quantitative measures that are at least a second threshold value, and determining the plurality of regions related to the training data by combining the first subset of the number of regions and the second subset of the number of regions.
[0065] The computer-readable storage medium may also include where determining the quantitative measures includes for individual genomic regions of the plurality of genomic regions, determining a first number of the amount of nucleic acid molecules that correspond to the individual genomic regions, for individual control genomic regions, determining a second number of the amount of nucleic acid molecules that correspond to the individual control genomic regions, where the individual control genomic regions include genomic regions having a minimum number of methylated cytosine-guanine dinucleotides in subjects in which a tumor is not present, and performing a transformation of the first number of the amount of nucleic acid molecules with respect to the second number of the amount of nucleic acid molecules.
[0066] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising for individual regions of the plurality of regions, determining a range of values of the quantitative measures that correspond to the individual regions, and determining a subset of the values of the quantitative measures that correspond to individual partitions of a number of partitions corresponding to the individual regions.
[0067] The computer-readable storage medium may also include where the number of partitions are distributed such that individual partitions of the number of partitions correspond to a same number of the amount of nucleic acid molecules included in the training data.
[0068] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising for individual first training samples of the plurality of first training samples: for individual genomic regions of the plurality of genomic regions, determining a partition of the number of partitions based on a number of the amount of nucleic acid molecules derivedAttorney Ref. No.: GH0230WOfrom the individual first training sample that correspond to the individual genomic region, and determining a number of tokens, individual tokens of the number of tokens corresponding to the partition of the number of partitions for the individual genomic region, where the training data includes the number of tokens for the plurality of genomic regions for the plurality of first training samples.
[0069] The computer-readable storage medium may also include where the second training process is performed with respect to labeled training data that includes first sequence representations derived from a first portion of the second training samples and second sequence representations derived from a second portion of the second training samples, the first portion of the second training samples being derived from first subjects in which a tumor is not detected and the second portion of the second training samples being derived from second subjects in which one or more cancer types or subtypes have been detected.
[0070] The computer-readable storage medium may also include where the one or more classifications include a first classification indicate that a tumor is present in a subject and a second classification indicating that a tumor is not present in a subject.
[0071] The computer-readable storage medium may also include where the one or more classifications include a first classification indicate a first cancer type and a second classification indicating a second cancer type. The computer-readable storage medium may also include where the one or more classifications correspond to one or more tumor fraction values.
[0072] The computer-readable storage medium may also include where the one or more classifications correspond to a level of homology directed repair with respect to a test subject.
[0073] The computer-readable storage medium may also include additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising obtaining first test sequencing data derived from one or more first test samples obtained from a test subject at a first time, determining a first classification for the test subject based on implementing the second machine learning model with respect to the first test sequencing data, obtaining second test sequencing data derived from one or more second test samples obtained from the test subject at a second time, and determining a second classification for the test subject based on implementing the second machine learning model with respect to the second test sequencing data.
[0074] The computer-readable storage medium may also include additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising determining an amount of progression of cancer or an amount ofAttorney Ref. No.: GH0230WOregression of cancer based on an amount of difference between the first classification and the second classification.
[0075] The computer-readable storage medium may also include where the first time is before administering one or more treatments to the test subject and the second time is after administering the one or more treatments to the test subject.
[0076] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising determining a level of effectiveness of the one or more treatments based on an amount of difference between the first classification and the second classification.
[0077] The computer-readable storage medium may also include includes additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising determining an indication of minimum residual disease based on an amount of difference between the first classification and the second classification.
[0078] The computer-readable storage medium may also include where the one or more classifications correspond to a presence of one or more genomic mutations present in nucleic acid molecules derived from samples obtained from subjects.
[0079] The computer-readable storage medium may also include where the second machine learning model implements one or more statistical classification models or one or more machine learning classification models that are different from the number of transformer blocks of the first machine learning model.
[0080] The computer-readable storage medium may also include where: the number of transformer blocks of the first machine learning model are arranged in a sequence having a first transformer block and a last transformer block with output activations from an individual transformer block being provided as input activations to a next transformer block in the sequence, the number of transformer blocks include a first number of neurons, the second machine learning model includes an additional transformer block having input activations that correspond to output activations of a next to last transformer block of the sequence, and the additional transformer block includes a second number of neurons that is different from the first number of neurons.
[0081] The computer-readable storage medium may also include where the first number of neurons corresponds to a number of tokens produced based on sequencing data obtained from one or more subjects and the second number of neurons corresponds to the one or more classifications.Attorney Ref. No.: GH0230WO
[0082] The computer-readable storage medium may also include additional computer-readable instructions that when executed by a computer, cause the computer to perform additional operations comprising after performing the second training process obtaining test sequencing data derived from a test sample obtained from a test subject, determining, based on the test sequencing data, quantitative measures for the plurality of genomic regions, individual quantitative measures indicating a number of nucleic acid molecules included in the test sample that (i) correspond to an individual genomic region of the plurality of genomic regions and (ii) have a number of methylated CpGs that correspond to the threshold amount of CpGs, determining, based on the quantitative measures, tokens for the test sample, individual tokens indicating a partition of a plurality of partitions for an individual genomic region of the plurality of genomic regions, providing the tokens as input to the second machine learning model, and determining, based on implementing the second machine learning model with respect to the tokens, a classification of the one or more classifications for the test subject.
[0083] Other technical features may be readily apparent to one skilled in the art from the following figures, descriptions, and claims.DEFINITIONS
[0084] In order for the present disclosure to be more readily understood, certain terms are first defined below. Additional definitions for the following terms and other terms may be set forth through the specification. If a definition of a term set forth below is inconsistent with a definition in an application or patent that is incorporated by reference, the definition set forth in this application should be used to understand the meaning of the term.
[0085] As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise. Thus, for example, a reference to “a method” includes one or more methods, and / or steps of the type described herein and / or which will become apparent to those persons of ordinary skill in the art upon reading this disclosure and so forth.
[0086] It is also to be understood that the terminology used herein is for the purpose of describing particular implementations only, and is not intended to be limiting. Further, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. In describing and claiming the methods, computer readable media, and systems, the following terminology, and grammatical variants thereof, will be used in accordance with the definitions set forth below.Attorney Ref. No.: GH0230WO
[0087] About: As used herein, “about” or “approximately” as applied to one or more values or elements of interest, refers to a value or element that is similar to a stated reference value or element. In certain implementations, the term “about” or “approximately” refers to a range of values or elements that falls within 25%, 20%, 19%, 18%, 17%, 16%, 15%, 14%, 13%, 12%, 11%, 10%, 9%, 8%, 7%, 6%, 5%, 4%, 3%, 2%, 1 %, or less in either direction (greater than or less than) of the stated reference value or element unless otherwise stated or otherwise evident from the context (except where such number would exceed 100% of a possible value or element).
[0088] Administer: As used herein, “administer” or “administering” a therapeutic agent (e.g., an immunological therapeutic agent) to a subject means to give, apply or bring the composition into contact with the subject. Administration can be accomplished by any of a number of routes, including, for example, topical, oral, subcutaneous, intramuscular, intraperitoneal, intravenous, intrathecal and intradermal.
[0089] Adapter. As used herein, “adapter” refers to a short nucleic acid (e.g., less than about 500 nucleotides, less than about 100 nucleotides, or less than about 50 nucleotides in length) that can be at least partially double-stranded and used to link to either or both ends of a given sample nucleic acid molecule. Adapters can include nucleic acid primer binding sites to permit amplification of a nucleic acid molecule flanked by adapters at both ends, and / or a sequencing primer binding site, including primer binding sites for sequencing applications, such as various next-generation sequencing (NGS) applications. Adapters can also include binding sites for capture probes, such as an oligonucleotide attached to a flow cell support or the like. Adapters can also include a nucleic acid tag as described herein. Nucleic acid tags can be positioned relative to amplification primer and sequencing primer binding sites, such that a nucleic acid tag is included in amplicons and sequence reads of a given nucleic acid molecule. The same or different adapters can be linked to the respective ends of a nucleic acid molecule. In some implementations, the same adapter is linked to the respective ends of the nucleic acid molecule except that the nucleic acid tag differs. In some implementations, the adapter is a Y-shaped adapter in which one end is blunt ended or tailed as described herein, for joining to a nucleic acid molecule, which is also blunt ended or tailed with one or more complementary nucleotides. In still other example implementations, an adapter is a bell-shaped adapter that includes a blunt or tailed end for joining to a nucleic acid molecule to be analyzed. Other examples of adapters include T-tailed and C-tailed adapters.
[0090] Alignment. As used herein, “alignment” or “align” refers to determining whether at least two sequence representations have at least a threshold amount of homology. In one or more examples, the threshold amount of homology can be at least about 90%, at least about 91%, atAttorney Ref. No.: GH0230WOleast about 92%, at least about 93%, at least about 94%, at least about 95%, at least about 96%, at least about 97%, at least about 98%, at least about 99%, at least about 99.5%, or at least about 99.9%. In situations where two sequence representations have at least the threshold amount of homology, the two sequence representations can be referred to as being “aligned.”
[0091] Amplify. As used herein, “amplify” or “amplification” in the context of nucleic acids refers to the production of multiple copies of a polynucleotide, or a portion of the polynucleotide, starting from a small amount of the polynucleotide (e.g., a single polynucleotide molecule), where the amplification products or amplicons are generally detectable. Amplification of polynucleotides encompasses a variety of chemical and enzymatic processes.
[0092] Barcode: As used herein, “barcode” or “molecular barcode” in the context of nucleic acids refers to a nucleic acid molecule comprising a sequence that can serve as a molecular identifier. For example, individual "barcode" sequences can be added to each DNA fragment during nextgeneration sequencing (NGS) library preparation so that each read can be identified and sorted before the final data analysis.
[0093] Cancer Type: As used herein, “cancer type” refers to a type or subtype of cancer defined, e.g., by histopathology. Cancer type can be defined by any conventional criterion, such as on the basis of occurrence in a given tissue (e.g., blood cancers, central nervous system (CNS), brain cancers, lung cancers (small cell and non-small cell), skin cancers, nose cancers, throat cancers, liver cancers, bone cancers, lymphomas, pancreatic cancers, bowel cancers, rectal cancers, thyroid cancers, bladder cancers, kidney cancers, mouth cancers, stomach cancers, breast cancers, prostate cancers, ovarian cancers, lung cancers, intestinal cancers, soft tissue cancers, neuroendocrine cancers, gastroesophageal cancers, head and neck cancers, gynecological cancers, colorectal cancers, urothelial cancers, solid state cancers, heterogeneous cancers, homogenous cancers), unknown primary origin and the like, and / or of the same cell lineage (e.g., carcinoma, sarcoma, lymphoma, cholangiocarcinoma, leukemia, mesothelioma, melanoma, or glioblastoma) and / or cancers exhibiting cancer markers, such as Her2, CA15-3, CA19-9, CA-125, CEA, AFP, PSA, HCG, hormone receptor and NMP-22. Cancers can also be classified by stage (e.g., stage 1, 2, 3, or 4) and whether of primary or secondary origin.
[0094] Carrier Signal: As used herein, “carrier signal” refers to any intangible medium that is capable of storing, encoding, or carrying transitory or non-transitory instructions 702 for execution by the machine 700, and includes digital or analog communications signals or other intangible medium to facilitate communication of such instructions 702. Instructions 702 may be transmitted or received over the network 734 using a transitory or non-transitory transmission medium via a network interface device and using any one of a number of well-known transfer protocols.Attorney Ref. No.: GH0230WO
[0095] Cell-Free Nucleic Acid: As used herein, “cell-free nucleic acid” refers to nucleic acids not contained within or otherwise bound to a cell or, in some implementations, nucleic acids remaining in a sample following the removal of intact cells. Cell-free nucleic acids can include, for example, all non-encapsulated nucleic acids sourced from a bodily fluid (e.g., blood, plasma, serum, urine, cerebrospinal fluid (CSF), etc.) from a subject. Cell-free nucleic acids include DNA (cfDNA), RNA (cfRNA), and hybrids thereof, including genomic DNA, mitochondrial DNA, circulating DNA, siRNA, miRNA, circulating RNA (cRNA), tRNA, rRNA, small nucleolar RNA (snoRNA), Piwi-interacting RNA (piRNA), long non-coding RNA (long ncRNA), and / or fragments of any of these. Cell-free nucleic acids can be double-stranded, single-stranded, or a hybrid thereof. A cell-free nucleic acid can be released into bodily fluid through secretion or cell death processes, e.g., cellular necrosis, apoptosis, or the like. Some cell-free nucleic acids are released into bodily fluid from cancer cells, e.g., circulating tumor DNA (ctDNA). Others are released from healthy cells. CtDNA can be non-encapsulated tumor-derived fragmented DNA. A cell-free nucleic acid can have one or more epigenetic modifications, for example, a cell-free nucleic acid can be acetylated, 5-methylated, ubiquitylated, phosphorylated, sumoylated, ribosylated, and / or citrullinated.
[0096] Cellular Nucleic Acids: As used herein, “cellular nucleic acids” means nucleic acids that are disposed within one or more cells at least at the point a sample is taken or collected from a subject, even if those nucleic acids are subsequently removed as part of a given analytical process.
[0097] Classification Region: As used herein, “classification region” refers to a genomic region that may show sequence-independent changes in neoplastic cells (e.g., tumor cells and cancer cells) or that may show sequence-independent changes in cfDNA from subjects having cancer relative to cfDNA from subjects in which cancer is not present. Examples of sequenceindependent changes include, but are not limited to, changes in methylation rate (increases or decreases), nucleosome distribution, CTCF binding, transcription start sites, and regulatory protein binding regions. In one or more examples, sequence-independent changes in a classification region can indicate the presence of a single form of cancer in a subject. In one or more additional examples, sequence-independent changes in a classification region can correspond to the presence of multiple forms in a subject. The classification region can be enriched by one or more probes. In addition, the classification region can be defined by a pair of primer binding sites. Further, the classification region can be defined by a predetermined beginning genomic locus and a predetermined ending genomic locus. The classification region can include from about 25 nucleotides to about 1500 nucleotides, from about 50 nucleotides to about 1000 nucleotides, from about 75 nucleotides to about 500 nucleotides, from about 25Attorney Ref. No.: GH0230WOnucleotides to about 250 nucleotides, from about 50 nucleotides to about 200 nucleotides, or from about 75 nucleotides to about 150 nucleotides. For instance, classification region can be a differentially methylated region. “Differentially methylated region” or“DMR” refers to a region of a genome comprised of nucleic acids having a detectably different degree of methylation in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type; or having a detectably different degree of methylation in at least one cell or tissue type obtained from a subject having a disease or disorder relative to the degree of methylation in the same region of DNA in the same cell or tissue type obtained from a healthy subject. In some embodiments, a differentially methylated region has a detectably higher degree of methylation (e.g., a hypermethylated region / hypermethylated target region) in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type that contribute to cfDNA in healthy individuals, or from the same cell or tissue type from a healthy subject. In some embodiments, a differentially methylated region has a detectably lower degree of methylation (e.g., a hypomethylated region / hypomethylated target region) in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type, such as other immune cell types and / or cell types that contribute to cfDNA in healthy individuals, or from the same cell or tissue type from a healthy subject. In some embodiments, the classification regions comprise hypermethylated target regions and / or hypomethylated target regions.
[0098] Communications Network: As used herein, “communications network” refers to one or more portions of a network 114, 1034 that may be an ad hoc network, an intranet, an extranet, a virtual private network (VPN), a local area network (LAN), a wireless LAN (WLAN), a wide area network (WAN), a wireless WAN (WWAN), a metropolitan area network (MAN), the Internet, a portion of the Internet, a portion of the Public Switched Telephone Network (PSTN), a plain old telephone service (POTS) network, a cellular telephone network, a wireless network, a Wi-Fi® network, another type of network, or a combination of two or more such networks. For example, a network 114, 1034 or a portion of a network may include a wireless or cellular network and the coupling may be a Code Division Multiple Access (CDMA) connection, a Global System for Mobile communications (GSM) connection, or other type of cellular or wireless coupling. In this example, the coupling may implement any of a variety of types of data transfer technology, such as Single Carrier Radio Transmission Technology (1xRTT), Evolution-Data Optimized (EVDO) technology, General Packet Radio Service (GPRS) technology, Enhanced Data rates for GSM Evolution (EDGE) technology, third Generation Partnership Project (3GPP) including 3G, fourth generation wireless (4G) networks, Universal Mobile Telecommunications System (UMTS), High SpeedAttorney Ref. No.: GH0230WOPacket Access (HSPA), Worldwide Interoperability for Microwave Access (WiMAX), Long Term Evolution (LTE) standard, others defined by various standard setting organizations, other long range protocols, or other data transfer technology.
[0099] Confidence Interval. As used herein, “confidence interval” means a range of values so defined that there is a specified probability that the value of a given parameter lies within that range of values.
[0100] Control Sample: As used herein, “control sample” or “reference sample” refers to a sample obtained from individuals without known copy number variation.
[0101] Coverage: As used herein, “coverage” or “coverage metrics” refer to the number of nucleic acid molecules or sequencing reads that correspond to a particular genomic region of a reference sequence.
[0102] Deoxyribonucleic Acid or Ribonucleic Acid: As used herein, “deoxyribonucleic acid” or “DNA” refers to a natural or modified nucleotide which has a hydrogen group at the 2'-position of the sugar moiety. DNA can include a chain of nucleotides comprising four types of nucleotide bases: adenine (A), thymine (T), cytosine (C), and guanine (G). As used herein, “ribonucleic acid” or “RNA” refers to a natural or modified nucleotide which has a hydroxyl group at the 2'-position of the sugar moiety. RNA can include a chain of nucleotides comprising four types of nucleotides: A, uracil (U), G, and C. As used herein, the term “nucleotide” refers to a natural nucleotide or a modified nucleotide. Certain pairs of nucleotides specifically bind to one another in a complementary fashion (called complementary base pairing). In DNA, adenine (A) pairs with thymine (T) and cytosine (C) pairs with guanine (G). In RNA, adenine (A) pairs with uracil (U) and cytosine (C) pairs with guanine (G). When a first nucleic acid strand binds to a second nucleic acid strand made up of nucleotides that are complementary to those in the first strand, the two strands bind to form a double strand. As used herein, “nucleic acid sequencing data”, “nucleic acid sequencing information”, “sequence information”, “sequence representation”, “nucleic acid sequence”, “nucleotide sequence”, “genomic sequence”, “genetic sequence”, “fragment sequence”, “sequencing read”, or “nucleic acid sequencing read” denotes any information or data that is indicative of the order and identity of the nucleotide bases (e.g., adenine, guanine, cytosine, and thymine or uracil) in a molecule (e.g., a whole genome, whole transcriptome, exome, oligonucleotide, polynucleotide, or fragment) of a nucleic acid such as DNA or RNA. It should be understood that the present teachings contemplate sequence information obtained using all available varieties of techniques, platforms or technologies, including, but not limited to: capillary electrophoresis, microarrays, ligation-based systems, polymerase-basedAttorney Ref. No.: GH0230WOsystems, hybridization-based systems, direct or indirect nucleotide identification systems, pyrosequencing, ion- or pH-based detection systems, and electronic signature-based systems.
[0103] Differentially Methylated Region: As used herein, differentially methylated region” refers to a region of DNA having a detectably different degree of methylation in at least one cell or tissue type relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type; or having a detectably different degree of methylation in at least one cell or tissue type obtained from a subject having a disease or disorder relative to the degree of methylation in the same region of DNA in the same cell or tissue type obtained from a healthy subject. In some embodiments, a differentially methylated region has a detectably higher degree of methylation (e.g.. a hypermethylated region) in at least one cell or tissue type, such as at least one immune cell type, relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type, such as other immune cell types and / or cell types that contribute to cfDNA in healthy individuals, or from the same cell or tissue type from a healthy subject. In some embodiments, a differentially methylated region has a detectably lower degree of methylation (e.g., a hypomethylated region) in at least one cell or tissue type, such as at least one immune cell type, relative to the degree of methylation in the same region of DNA from at least one other cell or tissue type, such as other immune cell types and / or cell types that contribute to cfDNA in healthy individuals, or from the same cell or tissue type from a healthy subject.
[0104] Driver Mutation: As used herein, “driver mutation” means a mutation that drives cancer progression.
[0105] Epigenetic Target Regions: As used herein, “epigenetic target regions” refers to target regions that may show sequence-independent differences in different cell or tissue types (e.g, different types of immune cells) or in neoplastic cells (e.g., tumor cells and cancer cells) relative to normal cells; or that may show sequence- independent differences (i.e., in which there is no change to the nucleotide sequence, e.g., differences in methylation, nucleosome distribution, or other epigenetic features) in DNA, such as cfDNA, from different cell types or from subjects having cancer relative to DNA, such as cfDNA, from healthy subjects, or in cfDNA originating from different cell or tissue types that ordinarily do not substantially contribute to cfDNA (e g., immune, lung, colon, etc.) relative to background cfDNA (e.g., cfDNA that originated from hematopoietic cells). Examples of sequence-independent changes include, but are not limited to, changes in methylation (increases or decreases), nucleosome distribution, cfDNA fragmentation patterns, CCCTC-binding factor (“CTCF”) binding, transcription start sites (e.g., with respect to any one of more of binding of RNA polymerase components, binding of regulatory proteins, fragmentation characteristics, and nucleosomal distribution), and regulatory protein binding regions. EpigeneticAttorney Ref. No.: GH0230WOtarget region sets thus include, but are not limited to, hypermethylation target region sets, hypomethylation target region sets, and fragmentation variable target region sets, such as CTCF binding sites and transcription start sites. For present purposes, loci susceptible to neoplasia-, tumor-, or cancer-associated focal amplifications and / or gene fusions may also be included in an epigenetic target region set because detection of a change in copy number by sequencing or a fused sequence that maps to more than one locus in a reference genome tends to be more similar to detection of exemplary epigenetic changes discussed above than detection of nucleotide substitutions, insertions, or deletions, e.g., in that the focal amplifications and / or gene fusions can be detected at a relatively shallow depth of sequencing because their detection does not depend on the accuracy of base calls at one or a few individual positions. An epigenetic target region set is a set of epigenetic target regions.
[0106] Hypermethylation: As used herein, “hypermethylation” refers to an increased level or degree of methylation of nucleic acid molecule(s) relative to the other nucleic acid molecules within a population (e.g., sample) of nucleic acid molecules from the same genomic locus. In some embodiments, hypermethylated DNA can include DNA molecules comprising at least 1 methylated cytosine, at least 2 methylated cytosines, at least 3 methylated cytosines, at least 5 methylated cytosines, or at least 10 methylated cytosines.
[0107] Hypomethylation: As used herein, “hypomethylation” refers to a decreased level or degree of methylation of nucleic acid molecule(s) relative to the other nucleic acid molecules within a population (e.g., sample) of nucleic acid molecules from the same genomic locus. In some embodiments, hypomethylated DNA includes unmethylated DNA molecules. In some embodiments, hypomethylated DNA can include DNA molecules comprising 0 methylated cytosine, at most 1 methylated cytosine, at most 2 methylated cytosines, at most 3 methylated cytosines, at most 4 methylated cytosines, or at most 5 methylated cytosines.
[0108] Immunotherapy: As used herein, “immunotherapy” refers to treatment with one or more agents that act to stimulate the immune system so as to kill or at least to inhibit growth of cancer cells, and preferably to reduce further growth of the cancer, reduce the size of the cancer and / or eliminate the cancer. Some such agents bind to a target present on cancer cells; some bind to a target present on immune cells and not on cancer cells; some bind to a target present on both cancer cells and immune cells. Such agents include, but are not limited to, checkpoint inhibitors and / or antibodies. Checkpoint inhibitors are inhibitors of pathways of the immune system that maintain self-tolerance and modulate the duration and amplitude of physiological immune responses in peripheral tissues to minimize collateral tissue damage (see, e.g., Pardoll, Nature Reviews Cancer 12, 252-264 (2012)). Example agents include antibodies against any ofAttorney Ref. No.: GH0230WOPD-1, PD-2, PD-L1, PD-L2, CTLA-40, 0X40, B7.1, B7He, LAG3, CD137, KIR, CCR5, CD27, or CD40. Other example agents include proinflammatory cytokines, such as I L-1 p, IL-6, and TNF-a. Other example agents are T-cells activated against a tumor, such as T-cells activated by expressing a chimeric antigen targeting a tumor antigen recognized by the T-cell.
[0109] Indel As used herein, “indel” refers to a mutation that involves the insertion or deletion of nucleotides in the genome of a subject.
[0110] Limit of Detection (LoD). As used herein, “limit of detection” means the smallest amount of a substance (e.g., a nucleic acid) in a sample that can be measured by a given assay or analytical approach.
[0111] Machine-Readable Medium: As used herein, “machine-readable medium” refers to a component, device, or other tangible media able to store instructions 702 and data temporarily or permanently and may include, but is not limited to, random-access memory (RAM), read-only memory (ROM), buffer memory, flash memory, optical media, magnetic media, cache memory, other types of storage (e.g., erasable programmable read-only memory (EEPROM)) and / or any suitable combination thereof. The term “machine-readable medium” may be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions 702. The term “machine-readable medium” shall also be taken to include any medium, or combination of multiple media, that is capable of storing instructions 702 (e.g., code) for execution by a machine 700, such that the instructions 702, when executed by one or more processors 704 of the machine 700, cause the machine 700 to perform any one or more of the methodologies described herein. Accordingly, a “machine-readable medium” refers to a single storage apparatus or device, as well as “cloud-based” storage systems or storage networks that include multiple storage apparatus or devices. The term “machine-readable medium” excludes signals per se.
[0112] Maximum MAP. As used herein, “maximum MAF” or “max MAF” refers to the maximum MAF (mutant allele fraction) of all somatic variants in a sample.
[0113] Methylation: As used herein, “methylation” or “DNA methylation” refers to addition of a methyl group to a nucleotide base in a nucleic acid molecule. In some embodiments, methylation refers to addition of a methyl group to a cytosine at a CpG site (cytosine-phosphate-guanine site (i.e., a cytosine followed by a guanine in a 5’ -> 3’ direction of the nucleic acid sequence). In some embodiments, DNA methylation refers to addition of a methyl group to adenine, such as in N6-methyladenine. In some embodiments, DNA methylation is 5-methylation (modification of the 5thcarbon of the 6-carbon ring of cytosine). In some embodiments, 5-methylation refers to addition of a methyl group to the 5C position of the cytosine to create 5-Attorney Ref. No.: GH0230WOmethylcytosine (5mC). In some embodiments, methylation comprises a derivative of 5mC. Derivatives of 5mC include, but are not limited to, 5-hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC), and 5-caryboxylcytosine (5-caC). In some embodiments, DNA methylation is 3C methylation (modification of the 3rdcarbon of the 6-carbon ring of cytosine). In some embodiments, 3C methylation comprises addition of a methyl group to the 3C position of the cytosine to generate 3-methylcytosine (3mC). Methylation can also occur at non CpG sites, for example, methylation can occur at a CpA, CpT, or CpC site. DNA methylation can change the activity of methylated DNA region. For example, when DNA in a promoter region is methylated, transcription of the gene may be repressed. DNA methylation is critical for normal development and abnormality in methylation may disrupt epigenetic regulation. The disruption, e.g., repression, in epigenetic regulation may cause diseases, such as cancer. Promoter methylation in DNA may be indicative of cancer.
[0114] Methylation-Dependent Nuclease: As used herein, “methylation-dependent nuclease” refers to a nuclease that preferentially cuts methylated DNA relative to unmethylated DNA. For example, a methylation-dependent nuclease may cut at or near a recognition sequence such as a restriction site in a manner dependent on methylation of at least one of the nucleobases in the recognition sequence, such as a cytosine. In some embodiments, the nucleolytic activity of the methylation-dependent nuclease is at least 10, 20, 50, or 100-fold higher on a methylated recognition site relative to an unmethylated control in a standard nucleolysis assay. Methylationdependent nucleases include methylation-dependent restriction enzymes.
[0115] Methylation-Dependent Restriction Enzyme: As used herein, “methylationdependent restriction enzyme” or “MDRE” refers to a restriction enzyme that is dependent on methylation of the DNA (e.g., cytosine methylation) i.e., the presence or absence of methyl group in a nucleotide base alters the rate at which the enzyme cleaves the target DNA. In some embodiments, the methylation dependent restriction enzymes do not cleave the DNA if a particular nucleotide base is unmethylated at the recognition sequence. For example, MspJI is a methylation dependent restriction enzyme with a recognition sequence “mCNNR(N9)” and it does not cleave DNA if the absence of the methylated cytosine (mC) in the recognition sequence.
[0116] Methylation-Sensitive Nuclease: As used herein, “methylation-sensitive nuclease” refers to a nuclease that preferentially cuts unmethylated DNA relative to methylated DNA. For example, a methylation-sensitive nuclease may cut at or near a recognition sequence such as a restriction site in a manner dependent on lack of methylation of at least one of the nucleobases in the recognition sequence, such as a cytosine. In some embodiments, the nucleolytic activity of the methylation-sensitive nuclease is at least 10, 20, 50, or 100-fold higherAttorney Ref. No.: GH0230WOon an unmethylated recognition site relative to a methylated control in a standard nucleolysis assay. Methylation-sensitive nucleases include methylation- sensitive restriction enzymes.
[0117] Methylation Sensitive Restriction Enzyme: As used herein, “methylation sensitive restriction enzyme” or “MSRE” refers to a restriction enzyme that is sensitive to the methylation status of the DNA (e.g., cytosine methylation) i.e., the presence or absence of methyl group in a nucleotide base alters the rate at which the enzyme cleaves the target DNA. In some embodiments, the methylation sensitive restriction enzymes do not cleave the DNA if a particular nucleotide base is methylated at the recognition sequence. For example, Hpall is a methylation sensitive restriction enzyme with a recognition sequence “CCGG” and it does not cleave DNA if the second cytosine in the recognition sequence is methylated.
[0118] Methylation rate: As used herein, “methylation rate” refers to the probability, likelihood, or percentage that a given base (for example: cytosine residue in a CpG) is methylated on a DNA molecule at a particular genomic region analyzed in the sample. In some embodiments, the methylation rate may be applied to a defined region that comprises one or more potentially methylated bases. In some embodiments, the methylation rate refers to the percentage of CpG residues methylated in a DNA molecule. In some embodiments, the methylation rate refers to the percentage of CpG residues methylated in molecules aligned to particular genomic position or genomic region. Methylation rate can be measured by a variety of methods including, but not limited to, either using bisulfite sequencing (any single base resolution like TAPS, EM-SEQ, etc.) or using partitioning (DNA molecule resolution). Methylation rate can be measured in different ways. One estimation can be by counting how many DNA fragments end up in each methylation dependent partition or by counting the number of converted CpGs per fragment in the case of bisulfite sequencing or any other base-level resolution sequencing methods. In addition, in the case of methylation dependent partitioning, the rate calculation can be normalized using a set of predefined regions with known methylation state (i.e., positive control regions and / or negative control regions) or spiked- in synthetic DNA with known methylation state, deriving rate-parametrized partition distributions and estimating the rate using a maximum likelihood approach. In one or more examples, the methylation rate can be determined by determining an abundance of sequencing reads that correspond to a portion of a genomic region. The portion of the genomic region can include a number of genomic locations of the genomic region for which at least a threshold number of sequencing reads overlap.
[0119] Methylation Status: As used herein, “methylation status” or “methylation state” can refer to the presence or absence of methyl group on a DNA base (e.g., cytosine) at a particular genomic position in a nucleic acid molecule. It can also refer to the degree of methylation in aAttorney Ref. No.: GH0230WOnucleic acid sequence (e.g., highly methylated, low methylated, intermediately methylated or unmethylated nucleic acid molecules). The methylation status can also refer to the number of nucleotides methylated in a particular nucleic acid molecule.
[0120] Modified Nucleotide Specific Binding Reagent: As used herein, refers to a binding reagent that is specific for, or targets, modified nucleotides. For example, a modified nucleotide can be a nucleotide that has been methylated, thus, the binding reagent can be specific for a methylated nucleotide. Examples of binding reagents include, but are not limited to, a methyl binding domain (MBD) of a methylation binding protein (“MBP”) or variants thereof, an antibody (and antibody variants e.g., single chain antibodies), aptamers, or combinations thereof. Thus, as disclosed throughout, the use of MBD can be exchanged for any other modified nucleotide specific binding reagent, provided the modified nucleotide specific binding reagent has the desired specificity and affinity for the specific modified base of interest in the selected implementation.
[0121] Mutant Allele Fraction. As used herein, “mutant allele fraction”, “mutation dose,” or “MAF” refers to the fraction of nucleic acid molecules harboring an allelic alteration or mutation at a given genomic position in a given sample. MAF is generally expressed as a fraction or a percentage. For example, an MAF can be less than about 0.5, 0.1, 0.05, or 0.01 (i.e., less than about 50%, 10%, 5%, or 1%) of all somatic variants or alleles present at a given locus.
[0122] Mutation. As used herein, “mutation” refers to a variation from a known reference sequence and includes mutations such as, for example, single nucleotide variants (SNVs), copy number variants or variations (CNVs) / aberrations, insertions or deletions (indels), gene fusions, transversions, translocations, frame shifts, duplications, repeat expansions, and epigenetic variants. A mutation can be a germline or somatic mutation. In some examples, a reference sequence for purposes of comparison is a wildtype genomic sequence of the species of the subject providing a test sample, typically the human genome.
[0123] Mutation Caller. As used herein, “mutation caller” means an algorithm (embodied in software or otherwise computer implemented) that is used to identify mutations in test sample data (e.g., sequence information obtained from a subject).
[0124] Mutation Count: As used herein, “mutation count” or “mutational count” refers to the number of somatic mutations in a whole genome or exome or targeted regions of a nucleic acid sample.
[0125] Negative Control Region: As used herein, “negative control region”, refers to a genomic region that is expected to be unmethylated or hypomethylated in essentially all samples, regardless of whether the DNA is derived from a cancer cell or a normal cell.Attorney Ref. No.: GH0230WO
[0126] Neoplasm: As used herein, the terms “neoplasm” and “tumor” are used interchangeably. They refer to abnormal growth of cells in a subject. A neoplasm or tumor can be benign, potentially malignant, or malignant. A malignant tumor is referred to as a cancer or a cancerous tumor.
[0127] Next Generation Sequencing: As used herein, “next generation sequencing” or “NGS” refers to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches, for example, with the ability to generate hundreds of thousands of relatively small sequencing reads at a time. Some examples of next generation sequencing techniques include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization.
[0128] Nucleic Acid Tag: As used herein, “nucleic acid tag” refers to a short nucleic acid (e.g., less than about 500 nucleotides, about 100 nucleotides, about 50 nucleotides, or about 10 nucleotides in length), used to distinguish nucleic acids from different samples (e.g., representing a sample index), or different nucleic acid molecules in the same sample (e.g., representing a molecular barcode), of different types, or which have undergone different processing. The nucleic acid tag comprises a predetermined, fixed, non-random, random or semi-random oligonucleotide sequence. Such nucleic acid tags may be used to label different nucleic acid molecules or different nucleic acid samples or sub-samples. Nucleic acid tags can be single-stranded, doublestranded, or at least partially double-stranded. Nucleic acid tags optionally have the same length or varied lengths. Nucleic acid tags can also include double-stranded molecules having one or more blunt-ends, include 5’ or 3’ single-stranded regions (e.g., an overhang), and / or include one or more other single-stranded regions at other locations within a given molecule. Nucleic acid tags can be attached to one end or to both ends of the other nucleic acids (e.g., sample nucleic acids to be amplified and / or sequenced). Nucleic acid tags can be decoded to reveal information such as the sample of origin, form, or processing of a given nucleic acid. For example, nucleic acid tags can also be used to enable pooling and / or parallel processing of multiple samples comprising nucleic acids bearing different molecular barcodes and / or sample indexes in which the nucleic acids are subsequently being deconvolved by detecting (e.g., reading) the nucleic acid tags. Nucleic acid tags can also be referred to as identifiers (e.g., molecular identifier, sample identifier). Additionally, or alternatively, nucleic acid tags can be used as molecular identifiers (e.g., to distinguish between different molecules or amplicons of different parent molecules in the same sample or sub-sample). This includes, for example, uniquely tagging different nucleic acid molecules in a given sample, or non-uniquely tagging such molecules. In the case of non-unique tagging applications, a limited number of tags (i.e., molecular barcodes) may be used to tag eachAttorney Ref. No.: GH0230WOnucleic acid molecule such that different molecules can be distinguished based on their endogenous sequence information (for example, start and / or stop positions where they map to a selected reference sequence, a sub-sequence of one or both ends of a sequence, and / or length of a sequence) in combination with at least one molecular barcode. A sufficient number of different molecular barcodes are used such that there is a low probability (e.g., less than about a 10%, less than about a 5%, less than about a 1%, or less than about a 0.1% chance) that any two molecules may have the same endogenous sequence information (e.g., start and / or stop positions, subsequences of one or both ends of a sequence, and / or lengths) and also have the same molecular barcode.
[0129] Partitioning: As used herein, “partitioning” refers to physically separating or fractionating a mixture of nucleic acid molecules in a sample based on a characteristic of the nucleic acid molecules. The partitioning can be physical partitioning of molecules. Partitioning can involve separating the nucleic acid molecules into groups or sets based on the level of epigenetic feature (for e.g., methylation). For example, the nucleic acid molecules can be partitioned based on the level of methylation of the nucleic acid molecules. In some embodiments, the methods and systems used for partitioning may be found in PCT Patent Application No. PCT / US2017 / 068329, which is hereby incorporated by reference in its entirety.
[0130] Partitioned set: As used herein, “partitioned set” or “partition” refers to a set of nucleic acid molecules partitioned into a set or group based on the differential binding affinity of the nucleic acid molecules or proteins associated with the nucleic acid molecules to a binding agent. A partitioned set may also be referred to as a subsample. The binding agent binds preferentially to the nucleic acid molecules comprising nucleotides with epigenetic modification. For example, if the epigenetic modification is methylation, the binding agent can be a methyl binding domain (MBD) protein. In some embodiments, a partitioned set can comprise nucleic acid molecules belonging to a particular level or degree of epigenetic feature (for e.g., methylation). For example, the nucleic acid molecules can be partitioned into three sets - one set for highly methylated nucleic acid molecules (first subsample, hyper partition, hyper partitioned set or hypermethylated partitioned set), a second set for low methylated nucleic acid molecules (second subsample, hypo partition, hypo partitioned set or hypomethylated partitioned set), and a third set for intermediate methylated nucleic acid molecules (third subsample, intermediate partitioned set, intermediately methylated partitioned set, residual partition, or residual partitioned set). In another example, the nucleic acid molecules can be partitioned based on the number of methylated nucleotides - one partitioned set can have nucleic acid molecules with nine methylatedAttorney Ref. No.: GH0230WOnucleotides, and another partitioned set can have unmethylated nucleic acid molecules (zero methylated nucleotides).
[0131] Polynucleotide. As used herein, “polynucleotide”, “nucleic acid”, “nucleic acid molecule”, “polynucleotide molecule”, or “oligonucleotide” refers to a linear polymer of nucleosides (including deoxyribonucleosides, ribonucleosides, or analogs thereof) joined by internucleosidic linkages. A polynucleotide can comprise at least three nucleosides. Oligonucleotides often range in size from a few monomeric units, e.g., 3-4, to hundreds of monomeric units. Whenever a polynucleotide is represented by a sequence of letters, such as “ATGCCTG,” it will be understood that the nucleotides are in 5’ -> 3’ order from left to right and that in the case of DNA, “A” denotes deoxyadenosine, “C” denotes deoxycytidine, “G” denotes deoxyguanosine, and “T” denotes deoxythymidine, unless otherwise noted. The letters A, C, G, and T may be used to refer to the bases themselves, to nucleosides, or to nucleotides comprising the bases, as is standard in the art.
[0132] Positive Control Region: As used herein, As used herein, “positive control region”, refers to a genomic region that is expected to be methylated or hypermethylated in essentially all samples, regardless of whether the DNA is derived from a cancer cell or a normal cell.
[0133] Probe: As used herein, “probe” refers to a polynucleotide comprising a functionality. The functionality can be a detectable label (fluorescent), a binding moiety (biotin), or a solid support (a magnetically attractable particle or a chip). Probes can include singlestranded DNA / RNA polynucleotides or double stranded DNA polynucleotides that hybridize to target nucleic acid sequences (e.g., SureSelect® probes, Agilent Technologies). Sequence capture using probes generally depends, in part, on the number of consecutive nucleotides in at least a portion of the target nucleic acid sequence that is complementary (or nearly complementary) to the sequence of the probe. In some examples, probes can correspond to driver mutations.
[0134] Processing: As used herein, the terms “processing”, “calculating”, and “comparing” can be used interchangeably. In certain applications, the terms refer to determining a difference, e.g., a difference in number or sequence. For example, gene expression, copy number variation (CNV), indel, and / or single nucleotide variant (SNV) values or sequences can be processed.
[0135] Processor. As used herein, “processor” refers to any circuit or virtual circuit (a physical circuit emulated by logic executing on an actual processor) that manipulates data values according to control signals (e.g., “commands,” “op codes,” “machine code,” etc.) and whichAttorney Ref. No.: GH0230WOproduces corresponding output signals that are applied to operate a machine. A processor may, for example, be a CPU, a RISC processor, a CISC processor, a GPU, a DSP, an ASIC, a RFIC or any combination thereof. A processor may further be a multi-core processor having two or more independent processors (sometimes referred to as “cores”) that may execute instructions contemporaneously.
[0136] Promoter Region As used herein, “promoter region” refers to a DNA sequence recognized by the natural machinery of the cell, or introduced synthetic machinery, required to initiate the specific transcription of a gene.
[0137] Quantitative Measures: As used herein, “quantitative measures” refers to an absolute or relative measure. A quantitative measure can be, without limitation, a number, a statistical measurement (e.g., frequency, mean, median, standard deviation, or quantile), or a degree or a relative quantity (e.g., high, medium, and low). A quantitative measure can be a ratio of two quantitative measures. A quantitative measure can be a linear combination of quantitative measures. A quantitative measure may be a normalized measure.
[0138] Reference Sequence: As used herein, “reference sequence” refers to a known sequence used for purposes of comparison with experimentally determined sequences. For example, a known sequence can be an entire genome, a chromosome, or any segment thereof. A reference sequence can include at least about 20, at least about 50, at least about 100, at least about 200, at least about 250, at least about 300, at least about 350, at least about 400, at least about 450, at least about 500, at least about 1000, or more nucleotides. A reference sequence can align with a single contiguous sequence of a genome or chromosome or can include noncontiguous segments that align with different regions of a genome or chromosome. Example reference sequences, include, for example, human genome reference sequences, such as, hG19 and hG38.
[0139] Sample: As used herein, “sample” means anything capable of being analyzed by the methods and / or systems disclosed herein.
[0140] Sensitivity: As used herein, “sensitivity” means the probability of detecting the presence of a single nucleotide variant, an insertion, and a deletion at a given MAF and coverage and the probability of detecting the presence of a copy number variant at a given tumor fraction and coverage.
[0141] Sequencing: As used herein, “sequencing” refers to any of a number of technologies used to determine the sequence (e.g., the identity and order of monomer units) of a biomolecule, e.g., a nucleic acid such as DNA or RNA. Example sequencing methods include, but are not limited to, targeted sequencing, single molecule real-time sequencing, exon or exomeAttorney Ref. No.: GH0230WOsequencing, intron sequencing, electron microscopy-based sequencing, panel sequencing, transistor-mediated sequencing, direct sequencing, random shotgun sequencing, Sanger dideoxy termination sequencing, whole-genome sequencing, sequencing by hybridization, pyrosequencing, capillary electrophoresis, duplex sequencing, cycle sequencing, single-base extension sequencing, solid-phase sequencing, high-throughput sequencing, massively parallel signature sequencing, emulsion PCR, co-amplification at lower denaturation temperature- PC R (COLD-PCR), multiplex PCR, sequencing by reversible dye terminator, paired-end sequencing, near-term sequencing, exonuclease sequencing, sequencing by ligation, short-read sequencing, single-molecule sequencing, sequencing-by-synthesis, real-time sequencing, reverse-terminator sequencing, nanopore sequencing, 454 sequencing, Solexa Genome Analyzer sequencing, SOLiD™ sequencing, MS-PET sequencing, and a combination thereof. In some implementations, sequencing can be performed by a gene analyzer such as, for example, gene analyzers commercially available from Illumina, Inc., Pacific Biosciences, Inc., or Applied Biosystems / Thermo Fisher Scientific, among many others.
[0142] Single Nucleotide Variant As used herein, “single nucleotide variant” or “SNV” means a mutation or variation in a single nucleotide that occurs at a specific position in the genome.
[0143] Somatic Mutation: As used herein, “somatic mutation” means a mutation in the genome that occurs after conception. Somatic mutations can occur in any cell of the body except germ cells and accordingly, are not passed on to progeny.
[0144] Specifically binds: As used herein, “specifically binds” in the context of an probe or other oligonucleotide and a target sequence means that under appropriate hybridization conditions, the oligonucleotide or probe hybridizes to its target sequence, or replicates thereof, to form a stable probe:target hybrid, while at the same time formation of stable probe: non-target hybrids is minimized. Thus, a probe hybridizes to a target sequence or replicate thereof to a sufficiently greater extent than to a non-target sequence, to enable capture or detection of the target sequence. Appropriate hybridization conditions are well-known in the art, may be predicted based on sequence composition, or can be determined by using routine testing methods (see, e.g., Sambrook et al., Molecular Cloning, A Laboratory Manual, 2nded. (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 1989) at §§ 1.90-1.91, 7.37-7.57, 9.47-9.51 and 11.47-11.57, particularly §§ 9.50-9.51, 11.12-11.13, 11.45-11.47 and 11.55-11.57, incorporated by reference herein).
[0145] Subject: As used herein, “subject” refers to an animal, such as a mammalian species (e.g., human) or avian (e.g., bird) species, or other organism, such as a plant. MoreAttorney Ref. No.: GH0230WOspecifically, a subject can be a vertebrate, e.g., a mammal such as a mouse, a primate, a simian or a human. Animals include farm animals (e.g., production cattle, dairy cattle, poultry, horses, pigs, and the like), sport animals, and companion animals (e.g., pets or support animals). A subject can be a healthy individual, an individual that has or is suspected of having a disease or a predisposition to the disease, or an individual that is in need of therapy or suspected of needing therapy. The terms “individual” or “patient” are intended to be interchangeable with “subject.”
[0146] For example, a subject can be an individual who has been diagnosed with having a cancer, is going to receive a cancer therapy, and / or has received at least one cancer therapy. The subject can be in remission of a cancer. As another example, the subject can be an individual who is diagnosed of having an autoimmune disease. As another example, the subject can be a female individual who is pregnant or who is planning on getting pregnant, who may have been diagnosed of or suspected of having a disease, e.g., a cancer, an auto-immune disease.
[0147] Target Region: As used herein, “target region” refers to a genomic locus targeted for identification and / or capture, for example, by using probes (e.g., through sequence complementarity). A “target region set” or “set of target regions” refers to a plurality of genomic loci targeted for identification and / or capture, for example, by using a set of probes (e.g., through sequence complementarity).
[0148] Threshold: As used herein, “threshold” refers to a predetermined value used to characterize experimentally determined values of the same parameter for different samples depending on their relation to the threshold.
[0149] Tumor Fraction: As used herein, “tumor fraction” refers to the estimate of the fraction of nucleic acid molecules derived from a tumor in a given sample. For example, the tumor fraction of a sample can be a measure derived from the max MAF of the sample or pattern of sequencing coverage of the sample or length of the cfDNA fragments in the sample or any other selected feature of the sample. In some instances, the tumor fraction of a sample is equal to the max MAF of the sample.
[0150] Variant: As used herein, a “variant” can be referred to as an allele. A variant is usually presented at a frequency of 50% (0.5) or 100% (1), depending on whether the allele is heterozygous or homozygous. For example, germline variants are inherited and usually have a frequency of 0.5 or 1. Somatic variants; however, are acquired variants and usually have a frequency of < 0.5. Major and minor alleles of a genetic locus refer to nucleic acids harboring the locus in which the locus is occupied by a nucleotide of a reference sequence, and a variant nucleotide different than the reference sequence respectively. Measurements at a locus can takeAttorney Ref. No.: GH0230WOthe form of allelic fractions (Afs), which measure the frequency with which an allele is observed in a sample.DETAILED DESCRIPTION
[0151] Cancer is usually caused by the accumulation of mutations within genes of an individual’s cells, at least some of which result in improperly regulated cell division. Such mutations can include single nucleotide variations (SNVs), gene fusions, insertions, transversions, translocations, and inversions. These mutations can also include copy number variations that correspond to an increase or a decrease in the number of copies of a gene within a tumor genome relative to an individual’s noncancerous cells. An extent of mutations present in cell-free nucleic acids and an amount of mutated cell-free nucleic acids of a sample can be used as biomarkers to determine tumor progression, predict patient outcome, and refine treatment choices. In various examples, the extent of mutations present in cell-free nucleic acids can be indicated by tumor cells copy number and tumor fraction for a given sample.
[0152] Additionally, cancer can be indicated by non-sequence modifications, such as methylation. Examples of methylation changes in cancer include local gains of DNA methylation in the CpG islands at the TSS of genes involved in normal growth control, DNA repair, cell cycle regulation, and / or cell differentiation. This increased amount of methylation can be associated with an aberrant loss of transcriptional capacity of involved genes and occurs at least as frequently as point mutations and deletions as a cause of altered gene expression.
[0153] Thus, DNA methylation profiling can be used to detect aberrant methylation in DNA of a sample. The DNA can correspond to certain genomic regions (“differentially methylated regions” or “DM Rs”) that are normally hypermethylated or hypomethylated in a given sample type (e.g., cfDNA from the bloodstream) but which may show an abnormal degree of methylation that correlates to a neoplasm or cancer, e.g., because of unusually increased contributions of tissues to the type of sample (e.g., due to increased shedding of DNA in or around the neoplasm or cancer) and / or from extents of methylation of the genome that are altered during development or that are perturbed by disease, for example, cancer or any cancer-associated disease.
[0154] Tumors can be detected in subjects based on a number of different types of data. For example, tumors can be detected by analyzing genetic data derived from subjects. To illustrate, tumors can be detected by analyzing genomic or somatic mutations indicated by the nucleic acid molecules obtained from the subjects. Additionally, tumors can be detected by analyzing epigenetic data derived from subjects. In one or more examples, tumors can be detected by analyzing methylation data derived from nucleic acid molecules obtained fromAttorney Ref. No.: GH0230WOsubjects. In one or more illustrative examples, tumors can be detected by analyzing methylation states of cytosines included in cytosine-guanine dinucleotides located in nucleic acid molecules derived from subjects. In many situations, the amount of nucleic acid molecules derived from subjects in which the specified genetic or epigenetic data being analyzed is present is sparse. As a result, the accuracy of computational models used to analyze the genetic or epigenetic data can be somewhat low. Accordingly, the health and / or treatment of subjects can be impacted detrimentally because of either false positive or false negative results based on inaccurate computational models that operate by analyzing detectable signals related to the biological samples that are relatively small in relation to the total amount of data derived from the biological samples.
[0155] Described herein are methods, techniques, and systems directed to implementing generative machine learning architectures to produce an indication related to a tumor being present or absent with respect to samples derived from subjects. In one or more examples, a transformer-based machine learning architecture can be implemented to produce characterization data for samples. The characterization data can then be analyzed by a classification computational model to generate a tumor indication for the samples. In various examples, the sample characterization data produced by the transformer-based machine learning architecture can be based on at least one of sequencing data or methylation state data. The methylation state data can correspond to methylation states of cytosines in cytosine-guanine dinucleotides located in specified genomic regions. In one or more additional examples, methylation state data and sequencing data can be preprocessed to produce tokens that are used as input to the transformer-based machine learning architecture.
[0156] The methods and systems described herein are directed to accurately identifying individuals in which a tumor is present. Additionally, the methods and systems described herein can be implemented to detect one or more types of cancer present in subjects. In one or more examples, a generative machine learning architecture can be implemented to analyze sequencing and methylation data obtained from subjects. In various examples, the generative machine learning architecture can produce training data for a classification computational model. The training data for the classification computational model that is produced by the generative machine learning architecture can include activations determined by one or more layers of the generative machine learning architecture for individual samples. That is, the individual training data samples are represented by the activations produced by layers of the generative machine learning architecture. In contrast, existing computational processes are typically trained by characterizing training samples according to a quantitative measure, metric, score, or indicationAttorney Ref. No.: GH0230WOof a biological condition (e.g., presence or absence of a tumor). By characterizing training samples for a classification computational model based on the activations of layers of the generative machine learning architecture rather than typical quantitative measures for characterizing samples, the implementations herein provide results that are more accurate than existing techniques, methods, and systems.
[0157] Figure 1 is a diagrammatic representation of an example computational architecture 100 that implements a generative machine learning model to detect the presence of one or more biological conditions in subjects, according to one or more example implementations. In one or more examples, the biological condition under consideration can be a type of cancer. Non-limiting examples of such cancers include biliary tract cancer, bladder cancer, transitional cell carcinoma, urothelial carcinoma, brain cancer, gliomas, astrocytomas, breast carcinoma, metaplastic carcinoma, cervical cancer, cervical squamous cell carcinoma, rectal cancer, colorectal carcinoma, colon cancer, hereditary nonpolyposis colorectal cancer, colorectal adenocarcinomas, gastrointestinal stromal tumors (GISTs), endometrial carcinoma, endometrial stromal sarcomas, esophageal cancer, esophageal squamous cell carcinoma, esophageal adenocarcinoma, ocular melanoma, uveal melanoma, gallbladder carcinomas, gallbladder adenocarcinoma, renal cell carcinoma, clear cell renal cell carcinoma, transitional cell carcinoma, urothelial carcinomas, Wilms tumor, leukemia, acute lymphocytic leukemia (ALL), acute myeloid leukemia (AML), chronic lymphocytic (CLL), chronic myeloid (CML), chronic myelomonocytic (CMML), liver cancer, liver carcinoma, hepatoma, hepatocellular carcinoma, cholangiocarcinoma, hepatoblastoma, Lung cancer, non-small cell lung cancer (NSCLC), small cell lung cancer (SCLC), mesothelioma, B-cell lymphomas, non-Hodgkin lymphoma, diffuse large B-cell lymphoma, Mantle cell lymphoma, T cell lymphomas, non-Hodgkin lymphoma, precursor T-lymphoblastic lymphoma / leukemia, peripheral T cell lymphomas, multiple myeloma, nasopharyngeal carcinoma (NPC), neuroblastoma, oropharyngeal cancer, oral cavity squamous cell carcinomas, osteosarcoma, ovarian carcinoma, pancreatic cancer, pancreatic ductal adenocarcinoma, pseudopapillary neoplasms, acinar cell carcinomas. Prostate cancer, prostate adenocarcinoma, skin cancer, melanoma, malignant melanoma, cutaneous melanoma, small intestine carcinomas, stomach cancer, gastric carcinoma, gastrointestinal stromal tumor (GIST), uterine cancer, or uterine sarcoma.
[0158] In one or more additional examples, the one or more biological conditions being detected by the computational architecture 100 can be related to identifying diseased tissue in subjects. For example, the one or more biological conditions can be related to diseased lung tissue, diseased liver tissue, diseased kidney tissue, diseased intestinal tissue, diseased skinAttorney Ref. No.: GH0230WOtissue, diseased stomach tissue, diseased heart tissue, diseased brain tissue, or one or more combinations thereof. The one or more biological conditions can also be related to one or more neurological disorders. In one or more additional examples, the one or more biological conditions can be related to homologous recombination deficiencies. In one or more further examples, the one or more biological conditions can correspond to the response of subjects to one or more treatments. In sill other examples, the computational architecture 100 can be implemented to detect the progression of one or more biological conditions, regression of one or more biological conditions, and / or responsiveness to treatment over a period of time.
[0159] The computational architecture 100 can include a sample 102. The sample 102 can be derived from a biological fluid obtained from a subject 104. For example, the sample 102 can be derived from blood obtained from a subject 104. In one or more additional examples, the sample 102 can be derived from tissue of a subject 104. In various examples, the sample 102 can be derived from multiple sources. To illustrate, the sample 102 can be derived from one or more fluids of a subject and / or from tissue of a subject 104. In one or more illustrative examples, the subject 104 can be a mammal. In one or more additional illustrative examples, the subject 104 can be a human. In one or more further illustrative examples, the subject 104 can be a nonhuman mammal. Although the illustrative example of Figure 1 is described in relation to a single sample 102, the implementations described with respect to Figure 1 can include a number of samples 102 obtained from a number of subjects 104. For example, the implementations described with respect to Figure 1 can be performed for hundreds of samples 102 obtained from hundreds of subjects 104, thousands of samples 102 obtained from thousands of subjects 104, tens of thousands of samples 102 obtained from tens of thousands of subjects 104, hundreds of thousands of samples 102 obtained from hundreds of thousands of subjects 104, or more. In at least some examples, a tumor can be present in at least a portion of the subjects 104. In one or more further examples, a tumor may not be detected in at least a portion of the subjects 104. That is, at least a portion of the subjects 104 can be cancer free.
[0160] The sample 102 can include a number of nucleic acids 106. Individual nucleic acids 106 can include a number of regions that have at least a threshold number of cytosine molecules and guanine molecules. In one or more examples, individual nucleic acids 106 can correspond to genomic regions having at least a threshold number of cytosine-guanine dinucleotides. In various examples, at least a portion of the cytosine-guanine pairs included in the regions can be sequentially located in sequences of the nucleic acids 106. In one or more illustrative examples, a region of a nucleic acid having at least a threshold amount of cytosine-guanine pairs can be referred to herein as a “CG region” or a “CpG region.” In one or more examples, a CG region canAttorney Ref. No.: GH0230WOinclude at least 200 CpG dinucleotides. In one or more illustrative examples, a CG region can include from 200 CpG dinucleotides to 5000 CpG dinucleotides, from 300 CpG dinucleotides to 3000 CpG dinucleotides, from 200 CpG dinucleotides to 2500 CpG dinucleotides, or from 500 CpG dinucleotides to 1500 CpG dinucleotides.
[0161] Individual CG regions can correspond to a number of molecules with one or more methylated cytosines. For example, individual CG regions can include a genomic region with a methylated cytosine. In one or more illustrative examples, the methylated cytosine can be 5-methylcytosine. Individual CG regions can also correspond to a genomic region including an unmethylated cytosine. In various examples, at least a portion of the CG regions of the nucleic acids 106 can correspond to classification regions of a reference genome. Classification regions can correspond to genomic regions of a reference genome that correspond to non-sequence differences that are consistent with one or more biological conditions, such as one or more types of cancer. In at least some examples, the non-sequence differences can include one or more mutations that are consistent with one or more biological conditions. In one or more examples, a classification region can correspond to a genomic region of the reference sequence for which molecules derived from subjects having at least one form of cancer. In at least some examples, nucleic acids having at least a threshold amount of methylated cytosines in at least one CG region (e.g., hypermethylated molecules) can be derived from subjects in which cancer is present and correspond to a classification. In one or more additional examples, nucleic acid molecules having less than a threshold amount of methylated cytosines (e.g., hypomethylated molecules) in at least one CG region can be derived from subjects in which cancer is present and correspond to a classification region.
[0162] In addition to the classification regions, the CG regions can include one or more positive control regions. The positive control regions can be mapped to nucleic acids having at least a threshold number of methylated cytosines in at least one CG region and that are derived from subjects that are free of cancer and are derived from subjects in which cancer is present. In various examples, the positive control regions can be hypermethylated in cells derived from subjects that are free of cancer and also in cells derived from subjects in which cancer is present. The CG regions can also include one or more negative control regions. The negative control regions can be mapped to nucleic acids having less than a threshold number of methylated cytosines in at least one CG region and that are derived from subjects that are free of cancer and also subjects in which cancer is present. In one or more illustrative examples, the negative control regions can be hypomethylated in subjects that are free of cancer and also in subjects in which cancer is present. In various examples, the positive control regions and the negative controlAttorney Ref. No.: GH0230WOregions can be used to perform normalization calculations. The normalization calculations can be performed to generate input data for one or more models that are implemented to determine tumor metrics for a given sample 102.
[0163] The architecture 100 can include one or more nucleobase methylation state detection processes 108 that are implemented with respect to one or more of the samples 102. The one or more nucleobase methylation state detection processes 108 can include one or more chemical processes and / or biochemical processes that impact a first type of nucleotide differently than a second type of nucleotide. For example, the one or more nucleobase methylation state detection processes 108 can include one or more reactions that cause at least one atomic and / or molecular moiety of the first type of nucleotide to be modified in a manner that is different from the manner in which the one or more reactions affect the second type of nucleotide. In one or more examples, the impact of the one or more nucleobase methylation state detection processes 108 on a given type of nucleotide can be based on one or more previous modifications to the given type of nucleotide in relation to an unmodified form of the given type of nucleotide. That is, in various examples, a molecule corresponding to a given type of nucleotide may have been modified before being subjected to the one or more nucleobase methylation state detection processes 108. To illustrate, before being subjected to the one or more nucleobase methylation state detection processes 108, nucleotides of nucleic acids 106 derived from at least one of the samples 102 can be modified due to mutations caused by the presence of a tumor in a subject. In at least some examples, the one or more nucleobase methylation state detection processes 108 can modify the first type of nucleotide or the second type of nucleotide such that the nucleobase pairing of the first type of nucleotide or the second type of nucleotide is altered.
[0164] In one or more illustrative examples, the one or more nucleobase methylation state detection processes 108 can be performed on nucleic acids 106 included in one or more samples 102. The one or more nucleobase methylation state detection processes 108 can modify a first type of nucleotide of the nucleic acids 106 in a first manner and one or more additional types of nucleotides of the nucleic acids 106 in a second manner. T o illustrate, the one or more nucleobase methylation state detection processes 108 can modify at least one of cytosines, guanines, thymine, or adenines differently than at least one other of cytosines, guanines, thymine, or adenines. In at least some examples, the one or more nucleobase methylation state detection processes 108 can modify cytosines differently than guanines, thymine, or adenines. In various examples, the one or more nucleobase methylation state detection processes 108 can modify cytosines such that the modified cytosines no longer pair with guanines. For example, the one or more nucleobase methylation state detection processes 108 can convert cytosines of the nucleicAttorney Ref. No.: GH0230WOacids 106 included in one or more samples 102 to uracils. In still other examples, the one or more nucleobase methylation state detection processes 108 may not modify cytosines that were methylated prior to being subjected to the one or more nucleobase methylation state detection processes 108. In one or more examples, the one or more nucleobase methylation state detection processes 108 may not modify 5-methylcytosines and / or 5-hydroxymethylcytosines of nucleic acids 106 derived from one or more samples 102. In this way, the one or more nucleobase methylation state detection processes 108 can be used to differentiate cytosines that have been previously modified to include a 5-methyl group versus previously unmodified cytosines.
[0165] In one or more examples, the one or more nucleobase methylation state detection processes 108 can include at least one of sodium bisulfite conversion and sequencing, Tet-assisted bisulfite sequencing (TAB-Seq), differential enzymatic cleavage, one or more single molecule sequencing methods, such as nanopore DNA sequencing, oxidative bisulfite (Ox-BS) conversion, APOBEC-coupled epigenetic (ACE) conversion, Enzymatic Methyl Sequencing (EM-Seq), single-enzyme 5-methylcytosine sequencing (SEM-seq), or direct methylation sequencing (DM-Seq).
[0166] In one or more additional examples, the one or more nucleobase methylation state detection processes 108 can include one or more processes that separate nucleic acids 106 based on amounts of nucleotides of the nucleic acids 106 that have been previously modified. For example, the one or more nucleobase methylation state detection processes 108 can determine a methylation rate for one or more regions of the nucleic acids 106 derived from one or more samples 102. In various examples, the one or more nucleobase methylation state detection processes 108 can separate nucleic acid molecules included in the one or more samples 102 based on amounts of methylated cytosines included in CG regions of individual nucleic acids 106. To illustrate, the one or more nucleobase methylation state detection processes 108 can separate the nucleic acids 106 derived from one or more samples 102 into a plurality of groups of nucleic acid molecules with individual groups of nucleic acid molecules corresponding to respective amounts of methylated cytosines of the nucleic acids 106. The one or more nucleobase methylation state detection processes 108 can include at least one of partitioning of nucleic acids 106 included in the one or more samples 102 based on a strength of binding of the individual nucleic acid molecules to methyl binding domain (MBD) and, optionally, treatment with methylation sensitive restriction enzyme (MSRE) and / or methylation dependent restriction enzyme (MDRE). In various examples, a strength of binding of nucleic acids to MBD can be determined by subjecting the nucleic acids to a series of washes having different concentrations of MBD.Attorney Ref. No.: GH0230WO
[0167] The one or more nucleobase methylation state detection processes 108 can generate methylation state data 110. The methylation state data 110 can indicate positions of nucleic acids 106 derived from one or more samples 102 that include a methylated cytosine. That is, in various examples, the methylation state data 110 can indicate positions of nucleic acid molecules derived from one or more samples 102 where at least one of a 5-methylcytosine and / or a 5-hydroxymethylcytosine is located. For example, the methylation state data 110 can indicate discrete, individual positions of individual nucleic acids 106 derived from one or more samples 102 that include at least one of a 5-methylcytosine and / or a 5-hydroxymethylcytosine. In one or more additional examples, the methylation state data 110 can indicate a group of positions of individual nucleic acids 106 derived from one or more 102 that include at least one of a 5-methylcytosine and / or a 5-hydroxymethylcytosine.
[0168] In at least some examples, the nucleobase methylation state detection processes 108 can include one or more sequencing processes. For example, the nucleobase methylation state detection processes 108 can include whole genome bisulfite sequencing, reduced representation bisulfite sequencing, targeted bisulfite sequencing, extended-representation bisulfite sequencing, or one or more combinations thereof. In one or more illustrative examples, whole genomic bisulfite sequencing can be performed according to the techniques described in T. Gong et al., “Analysis and performance assessment of the whole genome bisulfite sequencing data workflow: currently available tools and a practical guide to advance DNA methylation studies,” Small Methods, 6:e2101251, 2022. In one or more additional illustrative examples, reduced representation bisulfite sequencing can be performed according to techniques described in Meissner, A., Gnirke, A., Bell, G. W., Ramsahoye, B., Lander, E. S., and Jaenisch, R. (2005). Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic acids research 33, 5868-5877. In one or more further illustrative examples, targeted bisulfite sequencing can be performed according to techniques described in D. A. Moser et al., “Targeted bisulfite sequencing: A novel tool for the assessment of DNA methylation with high sensitivity and increased coverage,” Psychoneuroendocrinology, 120:1-8, 2020 and / or E. Leitao et al., “Locus-specific DNA methylation analysis by targeted deep bisulfite sequencing,” Methods Mol Biol, 1767:351-66, 2018. In still further illustrative examples, extended- representation bisulfite sequencing can be performed according to techniques described in Shareef, S. J., Bevill, S. M., Raman, A. T. et al. Extended-representation bisulfite sequencing of gene regulatory elements in multiplexed samples and single cells. Nat Biotechnol 39, 1086-1094 (2021).Attorney Ref. No.: GH0230WO
[0169] In at least some implementations, the nucleobase methylation state detection processes 108 can produce sequence representations 112. The sequence representations 112 can be derived from sequencing data produced by one or more sequencing machines. In one or more examples, the sequence representations 112 include alphanumeric representations of the nucleic acids 106 derived from one or more samples 102. For example, the sequence representations 112 can include, for individual nucleic acids, data that corresponds to a string of letters that represents the respective chains of nucleotides that correspond to the individual nucleic acids 106 derived one or more samples 102. At least one of the methylation state data 110 or the sequence representations 112 can be stored in one or more data files. For example, the sequence representations 112 can be stored in a FASTQ file that comprises a text-based sequencing data file format storing raw sequence data and quality scores. In one or more additional examples, the sequence representations 112 can be stored in a data file according to a binary base call (BCL) sequence file format. In one or more further examples, the sequence representations 112 can be stored in a BAM file. In one or more examples, the sequence representations 112 can comprise at least about one gigabyte (GB), at least about 2 GB, at least about 3GB, at least about 4 GB, at least about 5 GB, at least about 8 GB, or at least about 10 GB. An individual sequence representation included in sequence representations 112 can be referred to herein as a “read” or a “sequencing read.” In various examples, individual nucleic acids derived from the one or more samples 102 can correspond to multiple sequence representations included in the sequence representations 112 as a result of the amplification of the individual nucleic acids 106 that takes place as part of the one or more nucleobase methylation state detection processes 108. In situations where amplification of nucleic acids is not performed as part of the one or more nucleobase methylation state detection processes 108, individual nucleic acids 106 derived from one or more samples 102 can correspond to a single sequence representation included in the sequence representations 112 as a result of the absence of amplification of the individual nucleic acids 106.
[0170] When multiple sequence representations are present in the sequence representations 112 that correspond to a single nucleic acid derived from a sample 102, a number of groups can be generated from the sequence representations with each group corresponding to a single nucleic acid derived from a sample 102. In various examples, the groups of sequence representations included in the sequence representations 112 that correspond to a single nucleic acid can be referred to herein as “families.” In at least some examples, start and stop positions with respect to a reference sequence having a common molecular barcode can be used to determine groups of the sequence representations that correspond to individual nucleic acids. InAttorney Ref. No.: GH0230WOone or more illustrative examples, an individual sequence representation that represents a family of sequence representations and that corresponds to a single nucleic acid derived from a sample 102 can be referred to herein as a “consensus sequence representation.”
[0171] In one or more examples, consensus sequence representations generated from the sequence representations 112 can be used to generate molecule sequence representations. In various examples, individual molecule sequence representations can correspond to individual nucleic acids 106 derived from one or more samples 102. In at least some examples, the methylation state data 110 can indicate, for at least a portion of the individual positions of the individual sequence representations 112, a methylation state of a nucleotide present at an individual position. In one or more illustrative examples, the methylation state data 110 can indicate one or more individual positions of the individual sequence representations 112 having methylated cytosines.
[0172] The computational architecture 100 can include a computing system 114 that analyzes the methylation state data 110 and the sequence representations 112. In various examples, the computing system 114 can perform at least a portion of the operations described with respect to Figures 2-6. The computing system 114 can include one or more computing devices 116. The one or more computing devices 116 can include at least one of one or more desktop computing devices, one or more mobile computing devices, or one or more server computing device. In various examples, at least a portion of the one or more computing devices 116 can be included in a remote computing environment, such as a cloud computing environment. In one or more examples, the one or more nucleobase methylation state detection processes 108 and the operations executed by the computing system 116 can be performed by, controlled by, and / or maintained by a single organization. In one or more additional examples, the one or more nucleobase methylation state detection processes 108 and the operations executed by the computing system 116 can be performed by, controlled by, and / or maintained by multiple organizations.
[0173] In one or more examples, the computing system 114 can, at operation 118, analyze the methylation state data 110 and the sequence representations 112 to determine a subset of sequence representations satisfying one or more criteria. For example, the computing system 114 can analyze sequence representations included in at least one of the methylation state data 110 or the sequence representations 112 according to sequence representation criteria 120. The sequence representation criteria 120 can indicate a threshold number of methylated cytosines for a given sequence representation. To illustrate, the sequence representation criteria 120 can indicate that sequence representations having at least at least 5 methylated cytosines,Attorney Ref. No.: GH0230WOat least 6 methylated cytosines, at least 7 methylated cytosines, at least 8 methylated cytosines, at least 9 methylated cytosines, at least 10 methylated cytosines, at least 11 methylated cytosines, at least 12 methylated cytosines, at least 13 methylated cytosines, at least 14 methylated cytosines, or at least 15 methylated cytosines are to be identified by the computing system 114. In these scenarios, the sequence representation criteria 120 can correspond to the computing system 114 identifying sequence representations that are hypermethylated. In at least some examples, hypermethylated sequence representations can correspond to one or more partitions of nucleic acids that are produced by the one or more nucleobase methylation state detection processes 108.
[0174] In one or more additional examples, the sequence representation criteria 120 can indicate that sequence representations having no more than 7 methylated cytosines, no more than 6 methylated cytosines, no more than 5 methylated cytosines, no more than 4 methylated cytosines, no more than 3 methylated cytosines, no more than 2 methylated cytosines, or no more than 1 methylated cytosines are to be identified by the computing system 114. In these instances, the sequence representation criteria 120 can correspond to the computing system 114 identifying sequence representations that are hypomethylated. In various examples, hypomethylated sequence representations can correspond to one or more partitions of nucleic acids produced by the one or more nucleobase methylation state detection processes 108.
[0175] In one or more further examples, the sequence representation criteria 120 can indicate a number of enzyme cut sites present in relation to the sequence representations 112. For example, the sequence representation criteria 120 can indicate that sequence representations 112 having at least one enzyme cut site or at least two enzyme cut sites are present to determine a subset of the sequence representations for further analysis by the computing system 114. In one or more illustrative examples, the one or more enzyme cut sites can correspond to MBD cut sites. In one or more additional illustrative examples, the one or more enzyme cut sites can correspond to MSRE cut sites. In still other examples where the nucleobase methylation state detection processes 108 include one or more single site methylation techniques, the sequence representation criteria 120 can indicate that sequence representations 112 corresponding to partially methylated nucleic acids 106 are to be excluded from further analysis by the computing system 114. In this way, the sequence representation criteria 120 are implemented to identify a number of sequence representations that correspond to fully methylated nucleic acids 106. By implementing the sequence representation criteria 120 with respect to the methylation state data 110, the computing system 114 can generate a subset of the sequence representations for additional analysis by the computing system 114.Attorney Ref. No.: GH0230WO
[0176] The analysis performed at 118 by the computing system 114 can also include classification region criteria 122. The classification region criteria 122 can be implemented to determine classification regions that provide data that can be used to determine biological condition indications with respect to subjects. In one or more examples, the classification region criteria 122 can indicate a correlation between one or more allele fractions for individual classification regions and the presence of one or more biological conditions in subjects. In various examples, the classification region criteria 122 can indicate a correlation between a maximum mutant allele fraction (maxMAF) for classification regions and the presence or absence of one or more biological conditions in subjects. In one or more illustrative examples, the classification region criteria 122 can include a predetermined group of classification regions to be used to produce data that is analyzed to determine a biological condition indication. In one or more additional illustrative examples, the classification region criteria 122 can indicate a level of correlation between one or more allele fractions and individual classification regions with regard to identifying subjects in which a biological condition is present.
[0177] The analysis, at 118, of at least one of the methylation state data 110 or the sequence representations 112 with respect to the sequence representation criteria 120 and / or the classification region criteria 122 can produce selected region classification data 124. The selected classification region data 124 can include at least one of the methylation state data 110 or the sequence representations 112 derived from one or more samples 102 and that satisfy the sequence representation criteria 120 and the classification region criteria 122. Based on the selected classification region data 124, the computing system 114 can, at 126, determine quantitative measures 128 for the selected classification regions that correspond to the classification region criteria 122. In one or more examples, the quantitative measures 128 can include at least one of a number of nucleic acids derived from one or more samples 102 or a number of sequence representations 112 that correspond to the classification regions that satisfy the classification region criteria 122.
[0178] In various examples, the quantitative measures 128 can include, for an individual classification region, normalized log transformed values corresponding to a number of nucleic acids or a number of sequence representations that overlap with the individual classification region and that are derived from one or more samples 102. In at least some examples, the quantitative measures 128 can be normalized with respect to a number of nucleic acids or a number of sequence representations derived from one or more samples 102 that correspond to positive control regions. The positive control regions can correspond to genomic regions that include at least a threshold number of CpGs with one or more methylation states for subjects inAttorney Ref. No.: GH0230WOwhich a biological condition is present and for subjects in which the biological condition is absent. In one or more illustrative examples, the positive control regions can correspond to genomic regions having at least a threshold number of CpGs having at least a threshold number of methylated cytosines in subjects in which a biological condition is present and subjects in which the biological condition is absent. The quantitative measures 128 can also be generated using a pseudocount. In at least some examples, the quantitative measures 128 for an individual classification region can be determined according to the following:7 - region counts -+ 10-5)(Equation 1 }.xtotal positive control counts / In the equation, the region counts can correspond to a first number of sequence representations 112 that are derived from one or more samples 102 and that overlap with the individual classification region. The total positive control counts can correspond to a second number of sequence representations 112 that are derived from the one or more samples 102 and that overlap with the positive control regions. The term 10-5is a pseudocount used to calculate the quantitative measures 128.
[0179] The computing system 114 can, at 130, determine tokens 132 for the selected classification regions based on the quantitative measures 128. For example, the computing system 130 can generate a token 132 for individual classification regions that satisfy the classification region criteria 122 based on the values of the quantitative measure 128 that correspond to the individual classification regions. In one or more illustrative examples, individual tokens 132 can correspond to a partition of a group of partitions that represent individual classification regions. The partitions for individual classification regions can represent a range of values of quantitative measures for the individual classification region. Individual tokens 132 can be assigned to individual partitions. In one or more illustrative examples, the individual tokens 132 can include integer values that represent a range of quantitative measure values. The quantitative measure values can include rational numbers. A token 132 can be determined for an individual classification region by analyzing a quantitative measure 128 for the individual classification region with respect to the quantitative measure values corresponding to individual partitions. In response to determining a partition associated with a range of quantitative measure values that correspond to a given quantitative measure 128 for an individual classification region, the computing system 114 can determine a value of a token 132 for the individual classification region.
[0180] The tokens 132 can be input to a generative machine learning model 134. In various examples, the range of values for the individual partitions that correspond to individualAttorney Ref. No.: GH0230WOclassification regions and the token values that correspond to the individual partitions can be determined during a training process of the generative machine learning model 134. In one or more examples, the generative machine learning model 134 can include an autoregressive machine learning model. For example, the generative machine learning model 134 can include a transformed-based machine learning model.
[0181] The generative machine learning model 134 can include an artificial neural network including an interconnected assembly of neurons that are arranged in a number of computational blocks. Individual computational blocks can include a number of layers. The layers of the computational blocks can include an input layer, one or more hidden layers, and an output layer. Neurons included in the computational blocks receive input signals from a preceding layer and execute an activation function to produce an output that is transmitted to neurons in a subsequent layer. Signals are communicated between layers of the generative machine learning model 134 by synapses. The synapses can have a weight that is modified and tuned during a training process for the generative machine learning model. In various examples, individual neurons of a computational block can obtain weights of synapses connected to the individual neurons and apply the activation function of the neuron to the weights in order to generate an output that is provided to one or more neurons of an additional layer of the generative machine learning model 134.
[0182] The output of the generative machine learning model 134 can include activations 136. The activations 136 can include values determined by the neurons of the computational blocks of the generative machine learning model 134. For example, the activations 136 can include values of the neurons of the output layer of individual computational blocks of the generative machine learning model 134, where the values of the neurons of the output layer can be determined by applying an activation function to the input values obtained from the neurons of the preceding layer and weights corresponding to the connections between the neurons of the output layer and the neurons of the preceding layer. In this way, features of the individual samples 102 can be represented by the activations 136 produced by the generative machine learning model 134. In at least some examples, the generative machine learning model 134 can produce first activations that are indicative of samples obtained from subjects in which a biological condition is present and second activations that are indicative of samples obtained from subjects in which the biological condition is absent.
[0183] Additionally, the computing system 114 can implement a classification computational model 138. The input to the classification computational model 138 can include the activations 136 produced by the generative machine learning model 134. The classificationAttorney Ref. No.: GH0230WOcomputational model 138 can determine one or more classifications with respect to the samples 102. For example, the classification computational model 138 can determine a biological condition indication 140 for subjects 104 that provided the samples 102. In one or more illustrative examples, the biological condition indication 140 can correspond to a binary input corresponding to a biological condition being present in a subject 104 or a biological condition being absent from the subject 104. In one or more additional illustrative examples, the biological condition indication 140 can indicate a type of biological condition present in a subject 104. In scenarios where the biological condition indication 140 is related to cancer, the biological condition indication 140 can correspond to the presence or absence of one or more types of cancer with respect to subjects 104. In one or more further illustrative examples, the biological condition indication 140 can correspond to a probability of a biological condition 140 being present in a subject 104. The biological condition indication 140 can also correspond to other metrics related to the biological condition. To illustrate, in instances where the biological condition indication 140 is related to cancer, the biological condition indication 140 can include a tumor fraction.
[0184] Further, the biological condition indication 140 can correspond to one or more biomarkers present in subjects. In various examples, the biological condition indication 140 can correspond to the presence of one or more biomarkers that are indicative of one or more cancer types or subtypes. The biological condition indication 140 can also correspond to one or more genomic mutations being present in nucleic acids included in samples obtained from subjects. For example, the biological condition indication 140 can correspond to the presence or absence of one or more single nucleotide variants (SNVs), one or more copy number variants or variations (CNVs) / aberrations, one or more insertions or deletions (indels), one or more gene fusions, one or more transversions, one or more translocations, one or more frame shifts, one or more duplications, one or more repeat expansions, one or more epigenetic variants, or one or more combinations thereof. In still other examples, the biological condition indication 140 can correspond to a responsiveness of subjects to one or more treatments administered for one or more biological conditions. In addition, the biological condition indicator can correspond to a level of regression or a level of progression of a biological condition.
[0185] In various examples, the classification computational model 138 can implement a Random Forest algorithm. Additionally, the classification computational model 138 can implement a Naive Bayes algorithm. Further, the classification computational model 138 can implement a K-nearest Neighbors algorithm. The classification computational model 138 can also implement a gradient boosting algorithm. In still other examples, the classification computational model 138 can implement a support vector machine. In at least some examples, the classificationAttorney Ref. No.: GH0230WOcomputational model 138 can implement a logistic regression algorithm. In one or more examples, the classification computational model 138 can implement computational techniques that are different from those of the generative machine learning model 134. In one or more illustrative examples, the classification computational model 138 can include one or more additional computational blocks of the generative machine learning model 134.
[0186] In one or more illustrative examples, a subject 104 can provide a sample 102 and nucleic acids 106 can be extracted from the sample 102. The nucleic acids 106 can undergo one or more nucleobase methylation state detection processes 108 to determine methylation states of cytosines included in CpG regions of the nucleic acids 106. The nucleic acids 106 can also be subjected to one or more sequencing operations that produce sequence representations 112 that correspond to the nucleic acids 106. The methylation state data 110 and the sequence representations 112 can be analyzed with respect to the sequence representation criteria 120 and the classification region criteria 122 to determine selected classification region data 124 for the sample 102. In one or more examples, the selected classification region data 124 for the sample 102 can include a subset of the sequence representations 112 that satisfy the sequence representation criteria 120 and the classification region criteria 122. In these scenarios, the subset of the sequence representations 112 included in the selected classification region data 124 can then be used to determine quantitative measures 128 for individual classification regions. The quantitative measures 128 can then be transformed to a suitable input for the generative machine learning model 134 by producing individual tokens 132 for individual classification regions according to the group of the quantitative measures 128 that correspond to the individual classification regions. The tokens 132 can be provided as input to the generative machine learning model 134 that generates activations 136 based on the tokens 132. The activations 136 can correspond to values determined by neurons of the computational blocks of the generative machine learning model 134 that characterize the sample 102. The activations 136 can be provided to the classification computational model 138 that determines the biological condition indication 140 for the sample.
[0187] Figure 2 is a diagrammatic representation of an example framework 200 to produce tokens for a generative machine learning model based on classification region quantitative measures, according to one or more example implementations. In one or more examples, the generative machine learning model can include a large language machine learning model. In at least some examples, the generative machine learning model can include a transformer-based machine learning model.Attorney Ref. No.: GH0230WO
[0188] In existing systems, large language models can be implemented with input data that includes a number of tokens. The tokens can include integer values that are representative of words, subwords, phrases, symbols, one or more characters, or one or more combinations thereof, that are included in a language, dictionary, or vocabulary related to the large language machine learning model. In various examples, the large language model can be executed to predict one or more next words, subwords, characters, symbols, and / or phrases in a series of words, subwords, characters, symbols and / or phrases based on a preceding set of words, subwords, characters, symbols, and / or phrases. The tokenization process can include mapping features of the language, dictionary, and / or vocabulary of the large language machine learning model to a numerical token identifier.
[0189] Implementations herein are different from existing large language learning models because rather than predicting features of a vocabulary, dictionary, or language, the systems and methods herein are directed to predicting quantitative measures corresponding to classification regions based on methylation data and sequencing data derived from a sample. As a result, the tokenization process implemented by systems and methods herein maps quantitative measure values for individual classification regions to token identifiers. In addition, the tokenization process performed by systems and methods described herein transforms the methylation data and sequencing data obtained from a sample into data that can be processed by a large language machine learning model. Tokens can also be generated for the individual classification regions. Without using the tokenization process, the input to the large language machine learning model may be different from the input that the large language machine learning model is designed to analyze and process. As a result, without the tokenization process performed by systems and methods described herein, the accuracy of the large language machine learning models implemented herein may be somewhat low. Further, without performing the implementations of the tokenization described herein, additional memory resources and processing resources may be utilized to execute the large language machine learning model that would not have been utilized had the implementations of the tokenization process described herein been applied. Additionally, the tokenization process described herein has been tailored to the analysis of bioinformatics data, such as methylation data and sequencing data, using a large language machine learning model. In one or more examples, the implementations of the tokenization process described herein are applied to reduce the number of memory resources and processing resources utilized to apply large language machine learning models to the analysis of bioinformatics data while also improving the accuracy of the large language machine learning models.Attorney Ref. No.: GH0230WO
[0190] In at least some examples, the tokenization process implemented according to the framework 200 can be performed by the computing system 114 described in relation to Figure 1. The framework 200 can implement a tokenization process based on quantitative measures for a number of classification regions. For example, the framework 200 can include first classification region quantitative measures 202, second classification region quantitative measures 204, up to Nth classification region quantitative measures 206. In various examples, the framework 200 can be implemented with respect to quantitative measures generated from sequencing data corresponding to at least 5 classification regions, at least 10 classification regions, at least 25 classification regions, at least 50 classification regions, at least 100 classification regions, at least 250 classification regions, at least 500 classification regions, at least 1000 classification regions, at least 2500 classification regions, or more.
[0191] In one or more examples, the classification region quantitative measures 202, 204, 206 can include numerical values representing a number of sequence representations derived from one or more samples that overlap with a given quantitative region. The number of sequence representations that are included in the classification region quantitative measures 202, 204, 206 can indicate a number of nucleic acids derived from one or more samples. Additionally, the number of sequence representations that are included in the classification region quantitative measures 202, 204, 206 can indicate a number of sequencing reads derived from nucleic acids included in one or more samples. Overlap between sequence representations and classification regions can be determined by aligning the sequence regions with a reference genome that includes the classification regions and determining a number of nucleotides of the sequence representations that correspond to a series of nucleotides in the classification regions. A sequence representation can be identified with respect to a classification region in instances where at least 80%, at least 85%, at least 90%, at least 95%, at least 98%, at least 99%, at least 99.5%, or at least 99.9% of the nucleotides of the sequence representation correspond to a series of nucleotides in the classification region. In one or more illustrative examples, the classification region quantitative measures 202, 204, 206 can include values that are determined according to Equation 1, as explained previously.
[0192] In at least some examples, the quantitative measures included in the classification region quantitative measures 202, 204, 206 can also correspond to sequence representations having one or more methylation characteristics. For example, the quantitative measures included in the classification region quantitative measures 202, 204, 206, 208 can be generated from sequence representations having a number of cytosines included in one or more CpG regions that satisfy one or more methylation criteria. The one or more methylation criteria can indicateAttorney Ref. No.: GH0230WOthat sequence representations having at least a threshold number of methylated cytosines in one or more CpG regions are used to determine the classification region quantitative measures 202, 204, 206. In one or more additional examples, the one or more methylation criteria can indicate that sequence representations having no greater than an additional threshold number of methylated cytosines in one or more CpG regions are used to determine the classification region quantitative measures 202, 204, 206.
[0193] In still other examples, the classification region quantitative measures 202, 204, 206 can correspond to sequence representations having a threshold number of nucleotides. To illustrate, the classification region quantitative measures 202, 204, 206 can correspond to sequence representations having at least 50 nucleotides, at least 60 nucleotides, at least 70 nucleotides, at least 80 nucleotides, at least 90 nucleotides, at least 100 nucleotides, at least 110 nucleotides, at least 120 nucleotides, at least 130 nucleotides, at least 140 nucleotides, at least 150 nucleotides, at least 160 nucleotides, at least 170 nucleotides, at least 180 nucleotides, at least 190 nucleotides, or at least 200 nucleotides. In various examples, the classification region quantitative measures 202, 204, 206 can correspond to sequence representations having no greater than 500 nucleotides, no greater than 475 nucleotides, no greater than 450 nucleotides, no greater than 425 nucleotides, no greater than 400 nucleotides, no greater than 375 nucleotides, no greater than 350 nucleotides, no greater than 325 nucleotides, no greater than 300 nucleotides, no greater than 275 nucleotides, or no greater than 250 nucleotides. In one or more illustrative examples, the classification region quantitative measures 202, 204, 206 can correspond to sequence representations having from 50 nucleotides from 500 nucleotides, from 100 nucleotides to 400 nucleotides, from 50 nucleotides to 250 nucleotides, from 100 nucleotides to 300 nucleotides, from 150 nucleotides to 350 nucleotides, from 200 nucleotides to 400 nucleotides, from 250 nucleotides to 450 nucleotides, from 300 nucleotides to 500 nucleotides, from 50 nucleotides to 150 nucleotides, from 100 nucleotides to 200 nucleotides, from 150 nucleotides to 250 nucleotides, from 200 nucleotides to 300 nucleotides, from 250 nucleotides to 350 nucleotides, from 300 nucleotides to 400 nucleotides, from 350 nucleotides to 450 nucleotides, or from 400 nucleotides to 500 nucleotides. In one or more additional illustrative examples, the sequence representation criteria 120 can be applied to determine the sequence representations used to determine the classification region quantitative measures 202, 204, 204. Further, the classification region criteria 122 can be used to determine the classification regions that are included in the framework 200.
[0194] In one or more examples, the first classification region quantitative measures 202 can indicate a number of sequence representations derived from one or more samples that haveAttorney Ref. No.: GH0230WOat least a threshold amount of overlap with a first classification region. The first classification region quantitative measures 202 can include first classification region training data 208. The first classification region training data 208 can include quantitative measures derived from training samples that have at least a threshold amount of overlap with the first classification region. The training samples can be obtained from first subjects in which one or more biological conditions are present and second subjects in which the one or more biological conditions are not present. In at least some examples, the first classification region training data 208 can include unlabeled training data that does not differentiate between subjects in which the one or more biological conditions are present and subjects in which the one or more biological conditions are absent.
[0195] The first classification region quantitative measures 202 can also include first classification region sample data 210. The first classification region sample data 210 can include sequence representations that are derived from at least one test sample and that have at least a threshold amount of overlap with the first classification region. The at least one test sample can be obtained from a test subject that is being tested for one or more biological conditions. In at least some examples, the test subject can be tested for the one or more biological conditions using an assay that is performed in relation to the at least one sample provided by the test subject.
[0196] The framework 200 can include analyzing the first classification region quantitative measures 202 with respect to a first group of partitions 212. In one or more examples, the first group of partitions 212 can include at least 5 partitions, at least 10 partitions, at least 25 partitions, at least 50 partitions, at least 75 partitions, at least 100 partitions, at least 125 partitions, at least 150 partitions, at least 175 partitions, at least 200 partitions, at least 225 partitions, or at least 250 partitions. In one or more illustrative examples, the first group of partitions 212 can include from 5 partitions to 5000 partitions, from 10 partitions to 2500 partitions, from 25 partitions to 1000 partitions, from 50 partitions to 500 partitions, from 100 partitions to 300 partitions, from 150 partitions to 350 partitions, from 200 partitions to 400 partitions, or from 250 partitions to 450 partitions.
[0197] Individual partitions of the first group of partitions 212 can correspond to a range of values of the first classification region quantitative measures 202. In one or more examples, individual partitions of the first group of partitions 212 can correspond to a range of values of quantitative measures included in the first classification region training data 208. For example, quantitative measures included in the first classification region training data 208 can have a number of numerical values from a maximum value to a minimum value. The quantitative measures from the maximum value to the minimum value can be divided such that individual values for an individual partition correspond to a subset of values from the maximum value to theAttorney Ref. No.: GH0230WOminimum value. In at least some examples, the values from the maximum value to the minimum value can be divided evenly between the partitions. In these examples, the individual partitions can correspond to a same range of values. In one or more illustrative examples, the number of partitions can be pre-specified and the bin boundaries can be determined based on the range of values of classification region scores and the number of partitions. In one or more additional illustrative examples, values of the quantitative measures can be ranked and individual partitions can be associated with one or more of the ranked quantitative measures.
[0198] Individual partitions of the first group of partitions 212 can correspond to an individual token value. The illustrative example of Figure 2, a subset of the group of partitions includes a first partition 214, a second partition 216, a third partition 218, a fourth partition 220, and a fifth partition 222. The individual partitions 214, 216, 218, 220, 222 can correspond to respective token values. In one or more examples, the token values corresponding to the individual partitions of the first group of partitions 212 can include integer values. In one or more additional examples, the token values corresponding to the individual partitions of the first group of partitions 212 can include consecutive integer values.
[0199] The token value for a given quantitative measure can be determined by determining the partition of the first group of partitions 212 that corresponds to the value of the given quantitative measure. In response to determining a partition that corresponds to the value of a quantitative measure, a first classification region token 224 can be assigned to the quantitative measure. The first classification region token 224 can be provided as input to a transformer-based machine learning model that analyzes at least one of sequencing data or methylation data derived from one or more samples. In the illustrative example of Figure 2, a quantitative measure 226 can correspond to the third partition 218. In these scenarios, the quantitative measure 226 can correspond to a token value of the third partition 218. The quantitative measure 226 can be included in the first classification region training data 208 or the first classification region sample data 210.
[0200] In one or more illustrative examples, the first classification region training data 208 can include quantitative measures having values from 0.25 to 7.75. In implementations where the group of partitions includes 100 partitions, individual partitions can represent a range of 0.075. In these situations, an initial partition can correspond to quantitative measures having values from 0.250 to 0.325 and a next partition can correspond to quantitative measures having values from 0.325 to 0.400. In the illustrative example of Figure 1, the third partition 218 can correspond to quantitative measures having values from 1.500 to 1.575 and the quantitative measure 226 can have a value of 1.533. In various examples, the third partition 218 can have a token value of 20Attorney Ref. No.: GH0230WOand, as a result, the first classification region token 224 assigned to the quantitative measure 226 can be 20.
[0201] In one or more additional examples, the second classification region quantitative measures 204 can indicate a number of sequence representations derived from one or more samples that have at least a threshold amount of overlap with a second classification region. The second classification region quantitative measures 204 can include second classification region training data 228. The second classification region training data 228 can include quantitative measures derived from training samples that have at least a threshold amount of overlap with the second classification region. The training samples can be obtained from first subjects in which one or more biological conditions are present and second subjects in which the one or more biological conditions are not present. In at least some samples, the second classification region training data 228 can include unlabeled training data that does not differentiate between subjects in which the one or more biological conditions are present and subjects in which the one or more biological conditions are absent.
[0202] The second classification region quantitative measures 204 can also include second classification region sample data 230. The second classification region sample data 230 can include sequence representations that are derived from at least one test sample and that have at least a threshold amount of overlap with the second classification region. The at least one test sample can be obtained from a test subject that is being tested for one or more biological conditions. In at least some examples, the test subject can be tested for the one or more biological conditions using an assay that is performed in relation to the at least one sample provided by the test subject.
[0203] The framework 200 can include analyzing the second classification region quantitative measures 204 with respect to a second group of partitions 232. In one or more examples, the second group of partitions 232 can include at least 5 partitions, at least 10 partitions, at least 25 partitions, at least 50 partitions, at least 75 partitions, at least 100 partitions, at least 125 partitions, at least 150 partitions, at least 175 partitions, at least 200 partitions, at least 225 partitions, or at least 250 partitions. In one or more illustrative examples, the second group of partitions 232 can include from 5 partitions to 5000 partitions, from 10 partitions to 2500 partitions, from 25 partitions to 1000 partitions, from 50 partitions to 500 partitions, from 100 partitions to 300 partitions, from 150 partitions to 350 partitions, from 200 partitions to 400 partitions, or from 250 partitions to 450 partitions. In at least some examples, the second group of partitions 232 can include a same number of partitions as the first group of partitions 212. InAttorney Ref. No.: GH0230WOone or more further examples, the second group of partitions 232 can include a different number of partitions than the first group of partitions 212.
[0204] Individual partitions of the second group of partitions 232 can correspond to a range of values of the second classification region quantitative measures 204. In one or more examples, individual partitions of the second group of partitions 232 can correspond to a range of values of quantitative measures included in the second classification region training data 228. For example, quantitative measures included in the second classification region training data 228 can have a number of numerical values from a maximum value to a minimum value. The quantitative measures from the maximum value to the minimum value can be divided such that individual values for an individual partition correspond to a subset of values from the maximum value to the minimum value. In at least some examples, the values from the maximum value to the minimum value can be divided evenly between the partitions. In these examples, the individual partitions can correspond to a same range of values.
[0205] Individual partitions of the second group of partitions 232 can correspond to an individual token value. The illustrative example of Figure 2, a subset of the second group of partitions 232 includes an additional first partition 234, an additional second partition 236, an additional third partition 238, an additional fourth partition 240, and an additional fifth partition 242. In various examples, one or more of the additional partitions 234, 236, 238, 240, 242 can correspond to one or more of the partitions 214, 216, 218, 220, 220. In still other examples, one or more of the additional partitions 234, 236, 238, 240, 242 can be different from one or more of the partitions 214, 216, 218, 220, 222. The individual additional partitions 234, 236, 238, 240, 242 can correspond to respective token values. In one or more examples, the token values corresponding to the individual additional partitions of the second group of partitions 232 can include integer values. In one or more additional examples, the token values corresponding to the individual additional partitions of the second group of partitions 232 can include consecutive integer values.
[0206] The token value for a given quantitative measure of the second classification region quantitative measures 204 can be determined by determining the additional partition of the second group of partitions 232 that corresponds to the value of the given quantitative measure. In response to determining an additional partition that corresponds to the value of a quantitative measure, a second classification region token 244 can be assigned to the quantitative measure. The second classification region token 244 can be provided as input to a transformer-based machine learning model that analyzes at least one of sequencing data or methylation data derived from one or more samples. In the illustrative example of Figure 2, an additional quantitativeAttorney Ref. No.: GH0230WOmeasure 246 can correspond to the additional first partition 234. In these scenarios, the quantitative measure 246 can correspond to a token value of the additional first partition 234. The quantitative measure 246 can be included in the second classification region training data 228 or the second classification region sample data 230.
[0207] In one or more further examples, the Nth classification region quantitative measures 206 can indicate a number of sequence representations derived from one or more samples that have at least a threshold amount of overlap with an Nth classification region. The Nth classification region quantitative measures 206 can include Nth classification region training data 248. The Nth classification region training data 248 can include quantitative measures derived from training samples that have at least a threshold amount of overlap with the Nth classification region. The training samples can be obtained from first subjects in which one or more biological conditions are present and second subjects in which the one or more biological conditions are not present. In at least some samples, the Nth classification region training data 248 can include unlabeled training data that does not differentiate between subjects in which the one or more biological conditions are present and subjects in which the one or more biological conditions are absent.
[0208] The Nth classification region quantitative measures 206 can also include Nth classification region sample data 250. The Nth classification region sample data 250 can include sequence representations that are derived from at least one test sample and that have at least a threshold amount of overlap with the Nth classification region. The at least one test sample can be obtained from a test subject that is being tested for one or more biological conditions. In at least some examples, the test subject can be tested for the one or more biological conditions using an assay that is performed in relation to the at least one sample provided by the test subject.
[0209] The framework 200 can include analyzing the Nth classification region quantitative measures 206 with respect to a third group of partitions 252. In one or more examples, the third group of partitions 252 can include at least 5 partitions, at least 10 partitions, at least 25 partitions, at least 50 partitions, at least 75 partitions, at least 100 partitions, at least 125 partitions, at least 150 partitions, at least 175 partitions, at least 200 partitions, at least 225 partitions, or at least 250 partitions. In one or more illustrative examples, the third group of partitions 252 can include from 5 partitions to 5000 partitions, from 10 partitions to 2500 partitions, from 25 partitions to 1000 partitions, from 50 partitions to 500 partitions, from 100 partitions to 300 partitions, from 150 partitions to 350 partitions, from 200 partitions to 400 partitions, or from 250 partitions to 450 partitions. In at least some examples, the third group of partitions 252 can include a same number of partitions as at least one of the first group of partitions 212 or the second group of partitionsAttorney Ref. No.: GH0230WO232. In one or more further examples, the third group of partitions 252 can include a different number of partitions as at least one of the first group of partitions 212 or the second group of partitions 232.
[0210] Individual partitions of the third group of partitions 252 can correspond to a range of values of the Nth classification region quantitative measures 206. In one or more examples, individual partitions of the Nth group of partitions 252 can correspond to a range of values of quantitative measures included in the Nth classification region training data 248. For example, quantitative measures included in the Nth classification region training data 248 can have a number of numerical values from a maximum value to a minimum value. The quantitative measures from the maximum value to the minimum value can be divided such that individual values for an individual partition correspond to a subset of values from the maximum value to the minimum value. In at least some examples, the values from the maximum value to the minimum value can be divided evenly between the partitions. In these examples, the individual partitions can correspond to a same range of values.
[0211] Individual partitions of the third group of partitions 252 can correspond to an individual token value. The illustrative example of Figure 2, a subset of the third group of partitions 252 includes a further first partition 254, a further second partition 256, a further third partition 258, a further fourth partition 260, and a further fifth partition 262. In various examples, one or more of the further partitions 254, 256, 258, 260, 262 can correspond to one or more of the partitions 214, 216, 218, 220, 220 or to one or more of the additional partitions 234, 236, 238, 240, 242. In still other examples, one or more of the further partitions 234, 236, 238, 240, 242 can be different from one or more of the partitions 214, 216, 218, 220, 222 or one or more of the additional partitions 234, 236, 238, 240, 242. The individual further partitions 254, 256, 258, 260, 262 can correspond to respective token values. In one or more examples, the token values corresponding to the individual further partitions of the third group of partitions 252 can include integer values. In one or more additional examples, the token values corresponding to the individual further partitions of the third group of partitions 252 can include consecutive integer values.
[0212] The token value for a given quantitative measure of the Nth classification region quantitative measures 206 can be determined by determining the further partition of the Nth group of partitions 252 that corresponds to the value of the given quantitative measure. In response to determining a further partition that corresponds to the value of a quantitative measure, an Nth classification region token 264 can be assigned to the quantitative measure. The Nth classification region token 264 can be provided as input to a transformer-based machine learning model thatAttorney Ref. No.: GH0230WOanalyzes at least one of sequencing data or methylation data derived from one or more samples. In the illustrative example of Figure 2, a further quantitative measure 266 can correspond to the further fifth partition 262. In these scenarios, the further quantitative measure 266 can correspond to a token value of the further fifth partition 262. The further quantitative measure 266 can be included in the Nth classification region training data 248 or the Nth classification region sample data 250.
[0213] Figure 3 is a diagrammatic representation of an example framework 300 to implement a transformer machine learning architecture 302 and one or more classification computational models to detect the presence of a biological condition in subjects, according to one or more example implementations. The transformer machine learning architecture 302 can include one or more transformer blocks. In the illustrative example of Figure 3, the transformer machine learning architecture 302 can include a first transformer block 304 up to an Nth transformer block 306. The transformer blocks 304, 306 can include a number of layers of neurons. For example, the transformer blocks 304, 306 can include a first layer of neurons 308 and a second layer of neurons 310.
[0214] Although the illustrative example of Figure 3 shows that the transformer machine learning architecture 302 include two transformer blocks 304, 306 having two layers of neurons 308, 310, the transformer machine learning architecture 302 can be implemented with different numbers of transformer blocks and different numbers of neuron layers. To illustrate, the transformer machine learning architecture 302 can include at least 2 transformer blocks, at least 4 transformer blocks, at least 6 transformer blocks, at least 8 transformer blocks, at least 10 transformer blocks, at least 12 transformer blocks, at least 14 transformer blocks, at least 16 transformer blocks, at least 18 transformer blocks, at least 20 transformer blocks, at least 22 transformer blocks, at least 24 transformer blocks, at least 26 transformer blocks, at least 28 transformer blocks, or at least 30 transformer blocks. In one or more illustrative examples, the transformer machine learning architecture 302 can include from 2 to 40 transformer blocks, from 5 to 30 transformer blocks, from 10 to 20 transformer blocks, from 20 to 30 transformer blocks, from 30 to 40 transformer blocks, from 15 to 25 transformer blocks, or from 25 to 35 transformer blocks. In addition, individual transformer blocks can include at least one layer of neurons, at least two layers of neurons, at least 5 layers of neurons, at least 10 layers of neurons, at least 15 layers of neurons, at least 20 layers of neurons, at least 25 layers of neurons, or at least 50 layers of neurons. Further, the number of neurons included in each layer can include at least one neuron, at least 5 neurons, at least 10 neurons, at least 50 neurons, at least 100 neurons, at least 250 neurons, at least 500 neurons, at least 1000 neurons, or at least 2500. In one or more additionalAttorney Ref. No.: GH0230WOillustrative examples, individual layers of neurons can include from 1 to 20,000 neurons, from 10 to 10,000 neurons, from 1000 to 5000 neurons, from 1000 to 4000 neurons, from 2000 to 5000 neurons, from 3000 to 6000 neurons, from 4000 to 7000 neurons, or from 5000 to 8000 neurons.
[0215] Individual neurons of the transformer machine learning architecture 302 can implement an activation function. In one or more examples, neurons of the transformer machine learning architecture 302 can implement a rectified linear unit activation function. In one or more additional examples, neurons of the transformer machine learning architecture 302 can implement a Gaussian error linear unit activation function. In one or more further examples, neurons of the transformer machine learning architecture 302 can implement a SoftMax activation function. In various examples, the transformer machine learning architecture 302 can implement a sigmoid activation function. The transformer machine learning architecture 302 can also implement a gaussian error linear unit (GeLu) activation function. In one or more illustrative examples, the transformer machine learning architecture 302 can implement a sigmoid activation function for individual neurons of the transformer machine learning architecture 302 and a SoftMax activation function as a layer that obtains output from a layer of neurons of the transformer machine learning architecture 302.
[0216] In still other examples, the transformer machine learning architecture 302 can implement casual self-attention. Implementing self-attention in the transformer blocks of the transformer machine learning architecture 302 can cause the transformer machine learning architecture 302 to determine weights for different classification regions regardless of the genomic distance between classification regions. That is, casual self-attention can be implemented by the transformer machine learning architecture 302 to determine correlations between classification regions. The casual mask can ensure that a given neuron has access to information provided by previous neurons in the transformer machine learning architecture 302 and does not have access to subsequent neurons in the transformer machine learning architecture 302. To illustrate, a given neuron of the transformer machine learning architecture 302 can have access to previous tokens in the input sequence and neurons corresponding to subsequent tokens in the input can be zeroed out. Residual connections between layers of the transformer machine learning architecture 302 can enhance the flow of gradients to layers of the transformer machine learning architecture 302.
[0217] The framework 300 can include sample sequencing and methylation data 312. The sample sequencing and methylation data 312 can include information derived from nucleic acids obtained from one or more samples that have undergone one or more sequencing operations. For example, the sample sequencing and methylation data 312 can include sequence representations derived from nucleic acids obtained from the one or more samples. In one orAttorney Ref. No.: GH0230WOmore examples, sequence representations included in the sample sequencing and methylation data 312 can correspond to individual nucleic acids derived from the one or more samples. In one or more additional examples, sequence representations included in the sample sequencing and methylation data 312 can correspond to sequencing reads produced by one or more sequencing operations performed with respect to nucleic acids derived from the one or more samples.
[0218] The sample sequencing and methylation data 312 can also include information indicating methylation states of one or more nucleotides included in the sequence representations. In one or more examples, methylation states of the one or more nucleotides included in the sample sequencing and methylation data 312 can indicate methylated cytosines included in one or more CpGs of the sequence representations. In one or more additional examples, the methylation states of the one or more nucleotides included in the sample sequencing and methylation data 312 can indicate unmethylated cytosines included in one or more CpGs of the sequence representations. The methylation states of the one or more nucleotides can be determined by one or more nucleobase methylation state detection processes. In various examples, the one or more nucleobase methylation state detection processes used to generate the sample sequencing and methylation data 312 can correspond to the one or more nucleobase methylation state detection processes 108 described in relation to Figure 1.
[0219] The sample sequencing and methylation data 312 can be used to determine classification region quantitative measures 314. For individual classification regions, the classification region quantitative measures 314 can correspond to a number of sequence representations derived from one or more samples and that correspond to an individual classification region. In at least some examples, the classification region quantitative measures 314 can correspond to a number of sequence representations that are aligned with and have at least a threshold amount of overlap with an individual classification region. The classification region quantitative measures 314 can include a log10value that is determined based on a number of sequence representations derived from one or more samples that correspond to individual classification regions. Additionally, the classification region quantitative measures 314 can include a logit value that is determined based on a number of sequence representations derived from one or more samples that correspond to individual classification regions. Further, the classification region quantitative measures 314 can include values obtained by applying a geometric function, such as a sine function or a cosine function, to a number of sequence representations derived from one or more samples that correspond to individual classification regions.
[0220] The classification region quantitative measures 314 can be used to generate classification region tokens 316. The classification region tokens 316 can be generated forAttorney Ref. No.: GH0230WOindividual samples and can correspond to a set of classification regions. The classification region tokens 316 can be determined by transforming the values of the classification region quantitative measures 314 for one or more samples and for a set of classification regions. In one or more examples, the classification region tokens 316 can be generated by converting the values of the classification region quantitative measures 314 to integer values. In one or more illustrative examples, the classification region tokens 316 can be generated through the framework 200 described in relation to Figure 2.
[0221] The classification region tokens 316 can be used to produce input embeddings 318 by the transformer machine learning architecture 302. The input embeddings 318 can include vectors that correspond to the classification region tokens 316. In various examples, in addition to representing the classification region tokens 316, the input embeddings 318 can represent positional information related to the classification region tokens 316. For example, for individual classification region tokens 316, a corresponding input embedding 318 can indicate a genomic position related to the individual classification region token 316. To illustrate, for individual classification region tokens 316, a corresponding input embedding 318 can indicate a position within a chromosome that is related to the individual classification region token 316. In one or more examples, one or more input embeddings 318 can correspond to one or more classification region tokens 316. In various examples, positional embeddings and classification region embeddings can be combined via vector addition to producing the one or more input embeddings 318 that are fed into the first transformer block 304.
[0222] The input embeddings 318 can be provided to the first transformer block 304 and neurons of the first transformer block 304 can perform a number of calculations with respect to the input embeddings 318 to generate output values for the first transformer block 304 that are provided to a subsequent transformer block of the transformer machine learning architecture 302. The transformer machine learning architecture 302 can produce transformer block activations 320 based on the classification region tokens 316. In various examples, the transformer block activations 320 can characterize features of one or more samples. In one or more examples, the transformer block activations 320 can correspond to neuron values of an output layer of the individual transformer blocks of the transformer machine learning architecture 302. In one or more illustrative examples, the transformer block activations 320 can correspond to weights of connections between transformer blocks that are multiplied by the neuron values of the previous layer of a transformer block.
[0223] The transformer block activations 320 can be input data for a classification computational model 322. For example, the transformer block activations 320 can representAttorney Ref. No.: GH0230WOfeatures of individual samples. Thus, instead of representing features of samples by quantitative measures derived from the sample sequencing and methylation data 312, features of individual samples can be represented by a set of transformer block activations 320. By using the transformer block activations 320 to represent features of samples rather than using quantitative measures derived from the samples, the classification computational model 322 can be trained and executed using data that is more likely to be predictive of one or more biological conditions. Thus, using the transformer block activations 320 as input to the classification computational model 322 can improve the accuracy of the classification computational model 322.
[0224] The classification computational model 322 can implement at least one of one or more machine learning techniques or one or more statistical techniques to determine a biological condition indicator 324. In one or more examples, the classification computational model 322 can determine a biological condition indicator 324 that is related to a tumor being present in subjects. For example,, the biological condition indicator 324 can indicate whether a tumor is present in subjects or whether a tumor is not present in subjects. The biological condition indicator 324 can also indicate one or more types of cancer and / or subtypes of cancer that are present in subjects. In one or more additional examples, the biological condition indicator 324 can indicate tumor fractions that correspond to one or more samples. In one or more further examples, the biological condition indicator 324 can indicate a homology directed repair deficiency. In still other examples, the biological condition indicator 324 can indicate the presence or absence of one or more types of cells in subject. The biological condition indicator 324 can also indicate progression or regression of one or more types of cancer. In various examples, the biological condition indicator 324 can correspond to an effectiveness of one or more treatments administered to subjects in relation to one or more types of cancer. Additionally, the biological condition indicator 324 can also correspond to one or more genomic mutations being present in nucleic acids included in samples obtained from subjects. To illustrate, the biological condition indicator 324 can correspond to the presence or absence of one or more single nucleotide variants (SNVs), one or more copy number variants or variations (CNVs) / aberrations, one or more insertions or deletions (indels), one or more gene fusions, one or more transversions, one or more translocations, one or more frame shifts, one or more duplications, one or more repeat expansions, one or more epigenetic variants, or one or more combinations thereof.
[0225] In one or more examples, the classification computational model 322 can include a regression model. For example, the classification computational model 322 can include a linear regression model. Additionally, the classification computational model 322 can include a logistic regression model. Further, the classification computational model 322 can comprise a ridgeAttorney Ref. No.: GH0230WOregression model. In still other examples, the classification computational model 322 can comprise a lasso regression model. The classification computational model can also include a polynomial regression model. In various examples, the classification computational model 322 can implement one or more artificial neural networks. Although the illustrative example of Figure 3 shows a single classification computational model 322, in one or more additional examples, the framework 300 can include a plurality of classification computational models that can use the transformer block activations 320 to determine a number of different biological condition indicators 324. In at least some examples, the framework 300 can include multiple classification computational models 322 that can be trained to produce a number of biological condition indicators 324 relating to different biological conditions present in subjects.
[0226] In at least some examples, the classification computational model 322 can be implemented separate from the transformer machine learning architecture 302. That is, the classification computational model 322 can implement computational techniques that do not include the execution of transformer blocks. In one or more additional examples, the classification computational model 322 can be included in the transformer computational architecture 302. In these examples, one or more additional computational layers can be added to the transformer machine learning architecture 302 based on the classification being performed. In one or more examples, the transformer machine learning architecture 302 can undergo a fine tuning process to modify the transformer machine learning architecture 302 for one or more classification tasks.
[0227] In various examples, the transformer machine learning architecture 302 can include an output block that determines probabilities for a next token in a sequence. The transformer machine learning architecture 302 can determine a token with a highest probability and determine that the token is next in a sequence of tokens. In scenarios where the transformer machine learning architecture 302 includes the classification computational model 322, the output probability block can be replaced by a classification block. The neurons of the classification block can determine probabilities of a number of classifications related to the classification computational model 322. For example, in implementations where the classification computational model 322 determines whether a tumor is present or not present with respect to a subject, a classification block of the transformer machine learning architecture 302 can include a first neuron to determine a first probability of a tumor being present in a subject and a second neuron to determine a second probability of a tumor not being present in a subject. In one or more additional examples, where the classification computational model 322 determines one or more types of cancer present in subjects, the classification block of the transformer machine learning architecture 302 can include individual neurons that determine a probability of the individual typesAttorney Ref. No.: GH0230WOof cancer being present in subjects. To illustrate, a classification block of the transformer machine learning architecture 302 can include a first neuron to determine a first probability of a first type of cancer being present in subjects, a second neuron to determine a second probability of a second type of cancer being present in subjects, and a third neuron to determine a third probability of a third type of cancer being present in subjects. In various examples, the number of neurons present in the classification block of the transformer machine learning architecture 302 can correspond to a number of types of cancer being predicted by the transformer machine learning architecture 302.
[0228] Figure 4 is a diagrammatic representation of a framework 400 to train a transformer based machine learning architecture and a classification model to detect the presence of a tumor in a subject, in accordance with one or more example implementations. The framework 400 can include unlabeled training data 402. The unlabeled training data 402 can be derived from one or more first samples 404 that are obtained from a first group of subjects 406. The first group of subjects 406 can comprise first subjects in which a tumor is detected and second subjects in which a tumor is not detected. The unlabeled training data 402 may not indicate data that corresponds to the first subjects in which a tumor is detected and may not indicate data that corresponds to the second subjects in which a tumor is detected. In one or more examples, the unlabeled training data 402 can be generated based on sequencing data and methylation data produced using one or more nucleobase methylation state detection processes.
[0229] In one or more examples, the unlabeled training data 402 can be generated by implementing one or more of the nucleobase methylation state detection processes 108 described in relation to Figure 1. In one or more additional examples, the unlabeled training data 402 can include quantitative measures derived from sequencing data and methylation data produced by one or more nucleobase methylation state detection processes. In various examples, the unlabeled training data 402 can include quantitative measures that correspond to one or more classification regions. In at least some examples, the unlabeled training data 402 can include the training data used in the framework 200 described in relation to Figure 2, such as the first classification region training data 208, the second classification region training data 228, up to the Nth classification region training data 248. In one or more illustrative examples, the unlabeled training data 402 can include a number of tokens produced according to the framework 200 described in relation to Figure 2.
[0230] The framework 400 can include, at 408 performing a training process for a transformer machine learning architecture using the unlabeled training data 402. The training process can be used to determine a number of weights of the transformer machine learningAttorney Ref. No.: GH0230WOarchitecture. In one or more examples, the training process for the transformer machine learning architecture can determine weights of connections between neurons included in transformer blocks of the transformer machine learning architecture. Initially, the weights can be randomly assigned and modified according to a stochastic gradient descent technique. In one or more illustrative example, the training of the transformer machine learning architecture at 408 can implement an AdamW regularization method.
[0231] In one or more additional examples, the training process for the transformer machine learning architecture can also include a dropout regularization technique. The dropout regularization technique can cause output from one or more neurons of one or more layers of transformer blocks to be ignored during one or more iterations of the training process. A probability can be applied to a neuron being dropped out. In one or more illustrative examples, the dropout probability for training the transformer machine learning architecture can be 0.1, 0.2, 0.3, 0.4, or 0.5. During training of the transformer machine learning architecture at 408, cosine based annealing can be implemented with respect to the learning rate. In one or more additional illustrative examples, the training of the transformer machine learning architecture can be performed over at least 1000 iterations, at least 2500 iterations, at least 5000 iterations, at least 10,000 iterations, at least 25,000 iterations, or at least 50,000 iterations.
[0232] After performing the training process for the machine learning architecture at 408, a trained transformer machine learning architecture 410 can be produced. The weights of the connections between neurons of the trained transformer machine learning architecture 410 can be fixed. The training process for the transformer machine learning architecture performed at 408 can be implemented by a number of computing devices over a period of time. In one or more illustrative examples, the training process for the transformer machine learning architecture performed at 408 can be implemented by at least 2 computing devices, at least 3 computing devices, at least 4 computing devices, at least 5 computing devices, at least 6 computing devices, at least 7 computing devices, at least 8 computing devices, at least 9 computing devices, at least 10 computing devices, or more. The period of time for the training process of the transformer machine learning architecture can be at least 2 hours, at least 4 hours, at least 6 hours, at least 8 hours, at least 10 hours, at least 12 hours, at least 14 hours, at least 16 hours, at least 18 hours, at least 20 hours, or more.
[0233] After the trained transformer machine learning architecture 410 has been produced, the trained transformer machine learning architecture 410 can be used to produce training data for a classification model. For example, the trained transformer machine learning architecture 410 can generate transformer block activations 412 that can be used to train orAttorney Ref. No.: GH0230WOotherwise be incorporated into one or more classification models. The transformer block activations 412 can include values of neurons 414 of output layers of transformer blocks of the trained transformer machine learning architecture 410. In one or more illustrative examples, transformer block activations 412 can include at least 500 activations, at least 1000 activations, at least 1500 activations, at least 2000 activations, at least 2500 activations, at least 3000 activations, at least 3500 activations, at least 4000 activations, or more. In one or more examples, the classification model can implement a Random Forest algorithm, a Naive Bayes algorithm, a K-nearest Neighbors algorithm, a gradient boosting algorithm, a support vector machine, or a logistic regression algorithm. In one or more illustrative examples, the classification model can include one or more additional computational blocks of the trained transformer machine learning architecture 410.
[0234] The transformer block activations 412 can be used as part of a training process for a classification computational model at 416. The training process at 416 can be performed using labeled training data 418. The labeled training data 418 The labeled training data 418 can include information derived from one or more second samples 420 obtained from a second group of subjects 422. In one or more examples, the second group of subjects 422 can be free of a biological condition related to the classification model that is being trained. In one or more illustrative examples, the second group of subjects 422 can be free of cancer. That is, cancer has not been detected in the second group of subjects 422. The labeled training data 418 can indicate the information obtained from the one or more second samples as being from subjects in which a tumor is not detected. The labeled training data 418 can also include information derived from one or more third samples 424 obtained from a third group of subjects 426. In one or more additional examples, a biological condition related to the classification model that is being trained can be present in the third group of subjects 426. In one or more additional illustrative examples, a tumor is detected in the third group of subjects 426. The labeled training data 418 can indicate the information obtained from the one or more third samples 424 as being from subjects in which a tumor is detected.
[0235] In scenarios where a classification model is being trained to detect one or more genomic variations, the second samples 420 obtained from the second group of subjects 422 can be free of the one or more genomic variations and the one or more genomic variations can be present in the third samples 424 obtained from the third group of subjects 426. In situations where the classification model is being trained to determine a progression or regression of a biological condition, such as cancer, the second group of subjects 422 can correspond to individuals in which a biological condition has not progressed and the third group of subjects 426 canAttorney Ref. No.: GH0230WOcorrespond to individuals in which a biological condition has progressed. In still other instances, the second group of subjects 422 can include individuals for which one or more treatments for a biological condition, such as cancer, has been effective and the third group of subjects 426 can include individuals for which one or more treatments of the biological condition have not been effective.
[0236] In various examples, the labeled training data 418 can be generated by implementing one or more of the nucleobase methylation state detection processes 108 described in relation to Figure 1. In one or more additional examples, the labeled training data 418 can include quantitative measures derived from sequencing data and methylation data produced by one or more nucleobase methylation state detection processes being applied with respect to the one or more second samples 420 and the one or more third samples 424. In various examples, the labeled training data 418 can include quantitative measures that correspond to one or more classification regions. In at least some examples, the labeled training data 418 can include the training data used in the framework 200 described in relation to Figure 2, such as the first classification region training data 208, the second classification region training data 228, up to the Nth classification region training data 248. In one or more illustrative examples, the labeled training data 418 can include a number of tokens produced according to the framework 200 described in relation to Figure 2. In one or more illustrative examples, the labeled training data 418 can include information generated from at least 200 samples 420 or 424, at least 400 samples 420 or 424, at least 600 samples 420 or 424, at least 800 samples 420 or 424, at least 1000 samples 420 or 424, at least 1500 samples 420 or 424, or more. In one or more additional illustrative examples, the amount of the labeled training data 418 that corresponds to the one or more second samples 420 can be from about 40% to about 60% and the amount of the labeled training data 418 that corresponds to the one or more third samples 424 can be from about 40% to about 60%.
[0237] Performing the training process for the classification computational model at 416 can produce a trained classification computational model 428. The trained classification computational model 428 can determine a cancer indication for a test subject. The cancer indication can indicate that cancer is present or not present in a test subject. The cancer indication can also indicate a probability of one or more types of cancer being present in a test subject. Additionally, the cancer indication can include a tumor fraction. Further, the cancer indication can correspond to an amount of progression or regression of a type of cancer. In still other examples, the cancer indication can correspond to an effectiveness of one or more treatments provided to a test subject. In addition, the trained classification computational model 428 can be trained toAttorney Ref. No.: GH0230WOgenerate indications of one or more genomic mutations being present in nucleic acids included in samples obtained from subjects. For example, the trained classification computational model 428 can be trained to produce indications corresponding to the presence or absence of one or more single nucleotide variants (SNVs), one or more copy number variants or variations (CNVs) / aberrations, one or more insertions or deletions (indels), one or more gene fusions, one or more transversions, one or more translocations, one or more frame shifts, one or more duplications, one or more repeat expansions, one or more epigenetic variants, or one or more combinations thereof. In at least some examples, the trained classification computational model 428 can correspond to at least one of the classification computational model 138 described in relation to Figure 1 or the classification computational model 324 described in relation to Figure 3.
[0238] Figure 5 illustrates a framework 500 to determine classification regions to be used to identify sequence representations to be analyzed to determine an indication of a biological condition for subjects, according to one or more example implementations. The framework 500 can include first classification region sequence representations 502. The first classification region sequence representations 502 can be derived from a first group of subjects 504. In one or more examples, the first group of subjects 504 can include subjects in which a tumor is not detected. The first classification region sequence representations 502 can correspond to nucleic acids derived from samples obtained from the first group of subjects 504. In one or more additional examples, the first classification region sequence representations 502 can correspond to sequencing reads produced based on nucleic acids present in samples obtained from the first group of subjects 504.
[0239] Additionally, the framework 500 can include second classification region sequence representations 506. The second classification region sequence representations 506 can be derived from a second group of subjects 508.. In one or more examples, the second group of subjects 508 can include subjects in which a tumor is detected. The second classification region sequence representations 506 can correspond to nucleic acids derived from samples obtained from the second group of subjects 508. In one or more additional examples, the second classification region sequence representations 506 can correspond to sequencing reads produced based on nucleic acids present in samples obtained from the second group of subjects 508.
[0240] The framework 500 can include, at 510, determining classification region quantitative measures and classification region allele fractions with respect to a group of classification regions 512. The group of classification regions 512 can include a first classification region 514, a second classification region 516, up to an Nth classification region 518. In one orAttorney Ref. No.: GH0230WOmore illustrative examples, the group of classification regions 512 can correspond to genomic regions that include driver mutations for one or more types of cancer. In one or more additional illustrative examples, the group of classification regions 512 can correspond to one or more genes that include one or more mutations that correspond to one or more types of cancer.
[0241] In various examples, operation 510 can produce region allele fractions 520 that indicate a maximum mutant allele fraction (maxMAF) for individual classification regions of the group of classification regions 512. The maxMAF for an individual classification region can correspond to a number of first region sequence representations 502 and / or a number of second region sequence representations 506 that correspond to at least one allele present in the individual classification region. Additionally, operation 510 can produce classification region quantitative measures 522 that correspond to a number of first region sequence representations 502 and a number of second region sequence representations 506 that correspond to individual classification regions of the group of classification regions 512.
[0242] Additionally, the framework 500 can include, at 524, determining correlation metrics to select a subset of the group of classification regions 512 based on the classification region allele fractions 520 and the classification region quantitative measures 522. In one or more examples, at least one of a Pearson correlation or a Spearman correlation can be determined for an individual classification region included in the group of classification regions 512 based on the classification region allele fraction 520 for the individual classification region and classification region quantitative measures 522 for the individual classification region. In one or more illustrative examples, a maxMAF value can be determined for the first classification region 514 and a quantitative measure can be determined for the first classification region 514. At least one of a Pearson correlation procedure or a Spearman correlation procedure can determine a correlation metric between the maxMAF value and the quantitative measure for the first classification region 514.
[0243] In one or more further examples, the classification region quantitative measures 522 can be analyzed to determine a percentage of samples used to determine the first region sequence representations 502 and the second region sequence representations 506 that have at least one sequence representation that corresponds to an individual classification region. To illustrate, for the first classification region 514 a number of samples having at least one sequence representation corresponding to the first classification region 514 can be determined.
[0244] In various examples, operation 524 can determine classification region criteria 526. The classification region criteria 526 can indicate a subset of group of classification regions 512 that satisfy one or more threshold metrics. In one or more illustrative examples, a classificationAttorney Ref. No.: GH0230WOregion included in the group of classification regions 512 can be included in the classification region criteria 526 in response to at least one of a Spearman correlation score or a Pearson correlation score in relation to the maxMAF for the classification region being greater than a threshold correlation score. The threshold correlation score can be at least 0.20, at least 0.21, at least 0.22, at least 0.23, at least 0.24, at least 0.25, at least 0.26, at least 0.27, at least 0.28, at least 0.29, at least 0.30, at least 0.31, at least 0.32, at least 0.33, at least 0.34, at least 0.35, at least 0.36, at least 0.37, at least 0.38, at least 0.39, or at least 0.40.
[0245] In one or more additional examples, a classification region included in the group of classification regions 512 can be included in the classification region criteria 526 in response to at least a threshold amount of samples obtained from the first group of subjects 504 and the second group of subjects 508 having at least one nucleic acid corresponding to the individual classification region. In various examples, the threshold amount of samples can be at least 5% of the samples, at least 6% of the samples, at least 7% of the samples, at least 8% of the samples, at least 9% of the samples, at least 10% of the samples, at least 11% of the samples, at least 12% of the samples, at least 13% of the samples, at least 14% of the samples, at least 15% of the samples, at least 16% of the samples, at least 17% of the samples, at least 18% of the samples, at least 19% of the samples, or at least 20% of the samples. In one or more further illustrative examples, a subset of the group of classification regions 512 that satisfy the threshold metrics with respect to classification region allele fractions and sparsity of samples related to the classification region can be indicated in the classification region criteria 526. In at least some examples, the classification region criteria 526 can correspond to at least a portion of the classification region criteria 122 described in relation to Figure 1.
[0246] Figure 6 is a diagrammatic representation of a process 600 to implement a generative machine learning architecture to detect the presence of a tumor in subjects, in accordance with one or more example implementations. At operation 602, the process 600 can include obtaining training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions. Individual nucleic acid molecules can satisfy methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules. The nucleic acid molecules can be derived from a plurality of first training samples. The methylation criteria can correspond to a threshold number of methylated CpGs present in the individual nucleic acid molecules. The threshold number of methylated CpGs can correspond to a maximum number of methylated CpGs or a minimum number of methylated CpGs. In one or more additional examples, the methylation criteria can correspond to partitions of nucleic acid molecules related to subsamples derived fromAttorney Ref. No.: GH0230WOindividual training samples of the plurality of training samples. For example, individual training samples can be divided into a plurality of subsamples including a first subsample corresponding to a first partition and a second subsample corresponding to a second partition. The first partition can comprise nucleic acids with a cytosine modification in a greater proportion than additional nucleic acids included in the second partition. In various examples, the methylation criteria can correspond to nucleic acid molecules present in the first partition or nucleic acid molecules present in the second partition.
[0247] The process 600 can also include, at 604, performing a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions. The tokens can indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions.
[0248] At 606, the process 600 can include determining output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model. In addition, the process 600 can include, at 608, performing, based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to one or more biological conditions being present in one or more subjects. The second training process can be performed using sequence representations derived from second training samples. In one or more examples, the second machine learning model can implement one or more statistical classification models or one or more machine learning classification models that are different from the number of transformer blocks of the first machine learning model. In at least some examples, a training process for the second machine learning model can be performed with respect to labeled training data that includes first sequence representations derived from a first portion of the second training samples and second sequence representations derived from a second portion of the second training samples. The first portion of the second training samples can be derived from first subjects in which a tumor is not detected and the second portion of the second training samples can be derived from second subjects in which one or more cancer types or subtypes have been detected.
[0249] The one or more classifications can include a first classification indicating that a tumor is present in a subject and a second classification indicating that a tumor is not present in a subject. In addition, the one or more classifications can include a first classification indicating a first cancer type and a second classification indicating a second cancer type. Further, the one or more classifications can correspond to one or more tumor fraction values. The one or moreAttorney Ref. No.: GH0230WOclassifications can also correspond to a level of homology directed repair with respect to a test subject. In various examples, the one or more classifications can correspond to a presence of one or more genomic mutations present in nucleic acid molecules derived from samples obtained from subjects.
[0250] In at least some examples, the number of transformer blocks of the first machine learning model can be arranged in a sequence having a first transformer block and a last transformer block with output activations from an individual transformer block being provided as input activations to a next transformer block in the sequence and the number of transformer blocks can include a first number of neurons. In addition, the second machine learning model can include an additional transformer block having input activations that correspond to output activations of a next to last transformer block of the sequence where the additional transformer block includes a second number of neurons that is different from the first number of neurons. The first number of neurons can correspond to a number of tokens produced based on sequencing data obtained from one or more subjects and the second number of neurons can correspond to the one or more classifications.
[0251] In various examples, sequencing data can be obtained that is derived from a number of samples, the sequencing data corresponding to nucleic acid molecules included in the number of samples. Based on the sequencing data, quantitative measures for a number of genomic regions can be determined. Individual quantitative measures can correspond to a number of the nucleic acid molecules that correspond to an individual genomic region of the number of genomic regions. In one or more examples, a first subset of the number of regions having quantitative measures can be determined that are at least a first threshold value. A second subset of the number of regions having additional quantitative measures that are at least a second threshold value can also be determined. Additionally, the plurality of regions related to the training data can be determined by combining the first subset of the number of regions and the second subset of the number of regions.
[0252] The quantitative measures can also be determined by, for individual genomic regions of the plurality of genomic regions, determining a first number of the amount of nucleic acid molecules that correspond to the individual genomic regions and for individual control genomic regions, determining a second number of the amount of nucleic acid molecules that correspond to the individual control genomic regions. Individual control genomic regions can include genomic regions having a minimum number of methylated cytosine-guanine dinucleotides in subjects in which a tumor is not present. The quantitative measures can then be determinedAttorney Ref. No.: GH0230WOby performing a transformation of the first number of the amount of nucleic acid molecules with respect to the second number of the amount of nucleic acid molecules.
[0253] In various examples, input to the machine learning model can be determined for individual regions of the plurality of regions, determining a range of values of the quantitative measures that correspond to the individual regions. Additionally, a subset of the values of the quantitative measures that correspond to individual partitions of a number of partitions corresponding to the individual regions can be determined. The number of partitions are distributed such that individual partitions of the number of partitions correspond to a same number of the amount of nucleic acid molecules included in the training data. Further, for individual first training samples of the plurality of first training samples a partition of the number of partitions can be determined for individual genomic regions of the plurality of genomic regions based on a number of the amount of nucleic acid molecules derived from the individual first training sample that correspond to the individual genomic region. A number of tokens can then be determined with individual tokens of the number of tokens corresponding to the partition of the number of partitions for the individual genomic region. The training data for the first machine learning model can include the number of tokens for the plurality of genomic regions for the plurality of first training samples.
[0254] In one or more examples, first test sequencing data can be obtained that is derived from one or more first test samples obtained from a test subject at a first time and a first classification for the test subject can be determined based on implementing the second machine learning model with respect to the first test sequencing data. Additionally, second test sequencing data can be obtained that is derived from one or more second test samples obtained from the test subject at a second time and a second classification for the test subject can be determined based on implementing the second machine learning model with respect to the second test sequencing data. In one or more illustrative examples, an amount of progression of cancer or an amount of regression of cancer based on an amount of difference between the first classification and the second classification. In one or more additional illustrative examples, a level of effectiveness of the one or more treatments can be determined based on an amount of difference between the first classification and the second classification. In one or more further illustrative examples, an indication of minimum residual disease can be determined based on an amount of difference between the first classification and the second classification. In at least some examples, the first time can be before administering one or more treatments to the test subject and the second time can be after administering the one or more treatments to the test subject.Attorney Ref. No.: GH0230WO
[0255] In at least some examples, after performing the second training process, test sequencing data can be obtained that is derived from a test sample obtained from a test subject. Based on the test sequencing data, quantitative measures can be determined for the plurality of genomic regions. Individual quantitative measures can indicate a number of nucleic acid molecules included in the test sample that (i) correspond to an individual genomic region of the plurality of genomic regions and (ii) satisfy one or more methylation criteria. Based on the quantitative measures, tokens for the test sample can be determined with individual tokens indicating a partition of a plurality of partitions for an individual genomic region of the plurality of genomic regions. The tokens can be provided as input to the second machine learning model. Based on implementing the second machine learning model with respect to the tokens, a classification of the one or more classifications for the test subject can then be determined. EXEMPLARY METHODSA. Determining an indication of a biological condition in a sample
[0256] In one aspect, a method includes obtaining, by a computing system having one or more hardware processors and memory, training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules. The amount of nucleic acid molecules can be derived from a plurality of first training sample. In one or more aspects, the method includes performing, by the computing system and based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions. The tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions. Additionally, the method includes determining, by the computing system, output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model. Further, the method includes performing, by the computing system and based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to cancer being present in one or more subjects. The second training process is performed using sequence representations derived from second training samples.Attorney Ref. No.: GH0230WO
[0257] In one aspect, a computing apparatus includes one or more hardware processors. The computing apparatus also includes memory storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations including obtaining training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples; performing, based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions; determining output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model; and performing, based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to cancer being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.
[0258] In one aspect, one or more non-transitory computer-readable storage medium, the computer-readable storage medium including computer-readable instructions that, when executed by one or more hardware processors, cause the one or more hardware processors to perform operations including obtaining training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples; performing, based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions; determining output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model; and performing, based on the output activations of the individual transformerAttorney Ref. No.: GH0230WOblocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to cancer being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.B. Partitioning the sample into a plurality of subsamples
[0259] In some embodiments described herein, different forms of DNA (e.g., hypermethylated and hypomethylated DNA) are physically partitioned based on one or more characteristics of the DNA. This approach can be used to determine, for example, whether certain sites or regions are hypermethylated or hypomethylated. Partitioning can be performed before attaching adapters to DNA molecules in the sample, e.g., so as to facilitate including partition tags in the adapters. Partition tags can be used to identify which partition a molecule was found in. Following partitioning (and attachment of adapters if applicable), further steps such as amplification, target capture, and sequencing may be performed.
[0260] Methylation profiling can involve determining methylation patterns across different regions of the genome. For example, after partitioning molecules based on extent of methylation (e.g., relative number of methylated nucleobases per molecule) and further steps as discussed above including sequencing, the sequences of molecules in the different partitions can be mapped to a reference genome. This can show regions of the genome that, compared with other regions, are more highly methylated or are less highly methylated. In this way, genomic regions, in contrast to individual molecules, may differ in their extent of methylation.
[0261] Partitioning nucleic acid molecules in a sample can increase a rare signal, e.g., by enriching rare nucleic acid molecules that are more prevalent in one partition of the sample. For example, a genetic variation present in hypermethylated DNA but less (or not) present in hypomethylated DNA can be more easily detected by partitioning a sample into hypermethylated and hypomethylated nucleic acid molecules. By analyzing multiple partitions of a sample, a multidimensional analysis of a single molecule can be performed and hence, greater sensitivity can be achieved. Partitioning may include physically partitioning nucleic acid molecules into partitions or subsamples based on the presence or absence of one or more methylated nucleobases. A sample may be partitioned into partitions or subsamples based on a characteristic that is indicative of differential gene expression or a disease state. A sample may be partitioned based on a characteristic, or combination thereof that provides a difference in signal between a normal and diseased state during analysis of nucleic acids, e.g., cell free DNA (cfDNA), non-cfDNA, tumor DNA, circulating tumor DNA (ctDNA) and cell free nucleic acids (cfNA).Attorney Ref. No.: GH0230WO
[0262] In some embodiments, hypermethylation and / or hypomethylation variable epigenetic target regions are analyzed to determine whether they show differential methylation characteristic of particular immune cell types, such as rare immune cell types, tumor cells or cells of a type that does not normally contribute to the DNA sample being analyzed (such as cfDNA).
[0263] In some instances, heterogeneous DNA in a sample is partitioned into two or more partitions (e.g., at least 3, 4, 5, 6 or 7 partitions). In some embodiments, each partition is differentially tagged. Tagged partitions can then be pooled together for collective sample prep and / or sequencing. The partitioning-tagging-pooling steps can occur more than once, with each round of partitioning occurring based on a different characteristic (examples provided herein), and tagged using differential tags that are distinguished from other partitions and partitioning means. In other instances, the differentially tagged partitions are separately sequenced.
[0264] In some embodiments, sequence reads from differentially tagged and pooled DNA are obtained and analyzed in silico. Tags are used to sort reads from different partitions. Analysis to detect genetic variants can be performed on a partition-by-partition level, as well as whole nucleic acid population level. For example, analysis can include in silico analysis to determine genetic variants, such as CNV, SNV, indel, fusion in nucleic acids in each partition. In some instances, in silico analysis can include determining chromatin structure. For example, coverage of sequence reads can be used to determine nucleosome positioning in chromatin. Higher coverage can correlate with higher nucleosome occupancy in genomic region while lower coverage can correlate with lower nucleosome occupancy or nucleosome depleted region (NDR).
[0265] In some embodiments, partitioning is on the basis of one or more characteristics such as methylation. Molecules can be sorted according to other characteristics, such as sequence length, nucleosome binding, sequence mismatch, immunoprecipitation, and / or proteins that bind to DNA, using appropriate techniques as part of data analysis or partitioning as applicable. Resulting partitions can include one or more of the following nucleic acid forms: singlestranded DNA (ssDNA), double-stranded DNA (dsDNA), shorter DNA fragments and longer DNA fragments. In some embodiments, partitioning based on a cytosine modification (e.g., cytosine methylation) or methylation generally is performed and is optionally combined with at least one additional partitioning step, which may be based on any of the foregoing characteristics or forms of DNA. In some embodiments, a heterogeneous population of nucleic acids is partitioned into nucleic acids with one or more epigenetic modifications and without the one or more epigenetic modifications. Examples of epigenetic modifications include presence or absence of methylation; level of methylation; type of methylation (e.g., 5-methylcytosine versus other types of methylation, such as adenine methylation and / or cytosine hydroxymethylation); and association and level ofAttorney Ref. No.: GH0230WOassociation with one or more proteins, such as histones. Alternatively, or additionally, a heterogeneous population of nucleic acids can be partitioned into nucleic acid molecules associated with nucleosomes and nucleic acid molecules devoid of nucleosomes. Alternatively, or additionally, a heterogeneous population of nucleic acids may be partitioned into singlestranded DNA (ssDNA) and double-stranded DNA (dsDNA). Alternatively, or additionally, a heterogeneous population of nucleic acids may be partitioned based on nucleic acid length (e.g., molecules of up to 160 bp and molecules having a length of greater than 160 bp).
[0266] The agents used to partition populations of nucleic acids within a sample can be affinity agents, such as antibodies with the desired specificity, natural binding partners or variants thereof (Bock et al., Nat Biotech 28: 1106-1114 (2010); Song et al., Nat Biotech 29: 68-72 (2011)), or artificial peptides selected e.g., by phage display to have specificity to a given target. In some embodiments, the agent used in the partitioning is an agent that recognizes a modified nucleobase. In some embodiments, the modified nucleobase recognized by the agent is a modified cytosine, such as a methylcytosine (e.g., 5-methylcytosine). In some embodiments, the modified nucleobase recognized by the agent is a product of a procedure that affects the first nucleobase in the DNA differently from the second nucleobase in the DNA of the sample. In some embodiments, the modified nucleobase may be a “converted nucleobase,” meaning that its base pairing specificity was changed by the procedure. For example, certain procedures convert unmethylated or unmodified cytosine to dihydrouracil, or more generally, at least one modified or unmodified form of cytosine undergoes deamination, resulting in uracil (considered a modified nucleobase in the context of DNA) or a further modified form of uracil. Examples of partitioning agents include antibodies, such as antibodies that recognize a modified nucleobase, which may be a modified cytosine, such as a methylcytosine (e.g., 5-methylcytosine). In some embodiments, the partitioning agent is an antibody that recognizes a modified cytosine other than 5-methylcytosine, such as 5-carboxylcytosine (5caC). Alternative partitioning agents include methyl binding domain (MBDs) and methyl binding proteins (MBPs) as described herein, including proteins such as MeCP2.
[0267] Additional, non-limiting examples of partitioning agents are histone binding proteins which can separate nucleic acids bound to histones from free or unbound nucleic acids. Examples of histone binding proteins that can be used in the methods disclosed herein include RBBP4, RbAp48 and SANT domain peptides.
[0268] The binding of partitioning agents to particular nucleic acids and the partitioning of the nucleic acids into subsamples may occur to a certain extent or may occur in an essentially binary manner. In some instances, nucleic acids comprising a greater proportion of a certainAttorney Ref. No.: GH0230WOmodification bind to the agent at a greater extent than nucleic acids comprising a lesser proportion of the modification. Similarly, the partitioning may produce subsamples comprising greater and lesser proportions of nucleic acids comprising a certain modification. Alternatively, the partitioning may produce subsamples comprising essentially all or none of the nucleic acids comprising the modification. In all instances, various levels of modifications may be sequentially eluted from the partitioning agent.
[0269] In some embodiments, partitioning can comprise both binary partitioning and partitioning based on degree / level of modifications. For example, methylated fragments can be partitioned by methylated DNA immunoprecipitation (MeDIP), or all methylated fragments can be partitioned from unmethylated fragments using methyl binding domain proteins (e.g., MethylMinder Methylated DNA Enrichment Kit (ThermoFisher Scientific). Subsequently, additional partitioning may involve eluting fragments having different levels of methylation by adjusting the salt concentration in a solution with the methyl binding domain and bound fragments. As salt concentration increases, fragments having greater methylation levels are eluted.
[0270] In some instances, the final partitions are enriched in nucleic acids having different extents of modifications (overrepresentative or underrepresentative of modifications). Overrepresentation and underrepresentation can be defined by the number of modifications born by a nucleic acid relative to the median number of modifications per strand in a population. For example, if the median number of 5-methylcytosine residues in nucleic acid in a sample is 2, a nucleic acid including more than two 5-methylcytosine residues is overrepresented in this modification and a nucleic acid with 1 or zero 5-methylcytosine residues is underrepresented. The effect of the affinity separation is to enrich for nucleic acids overrepresented in a modification in a bound phase and for nucleic acids underrepresented in a modification in an unbound phase (i.e., in solution). The nucleic acids in the bound phase can be eluted before subsequent processing.
[0271] When using MeDIP or MethylMiner®Methylated DNA Enrichment Kit (ThermoFisher Scientific) various levels of methylation can be partitioned using sequential elutions. For example, a hypomethylated partition (no methylation) can be separated from a methylated partition by contacting the nucleic acid population with the MBD from the kit, which is attached to magnetic beads. The beads are used to separate out the methylated nucleic acids from the non- methylated nucleic acids. Subsequently, one or more elution steps are performed sequentially to elute nucleic acids having different levels of methylation. For example, a first set of methylated nucleic acids can be eluted at a salt concentration of 160 mM or higher, e.g., at least 150 mM, at least 200 mM, 300 mM, 400 mM, 500 mM, 600 mM, 700 mM, 800 mM, 900 mM,Attorney Ref. No.: GH0230WO1000 mM, or 2000 mM. After such methylated nucleic acids are eluted, magnetic separation is once again used to separate higher level of methylated nucleic acids from those with lower level of methylation. The elution and magnetic separation steps can be repeated to create various partitions such as a hypomethylated partition (enriched in nucleic acids comprising no methylation), a methylated partition (enriched in nucleic acids comprising low levels of methylation), and a hyper methylated partition (enriched in nucleic acids comprising high levels of methylation).
[0272] In some methods, nucleic acids bound to an agent used for affinity separationbased partitioning are subjected to a wash step. The wash step washes off nucleic acids weakly bound to the affinity agent. Such nucleic acids can be enriched in nucleic acids having the modification to an extent close to the mean or median (i.e., intermediate between nucleic acids remaining bound to the solid phase and nucleic acids not binding to the solid phase on initial contacting of the sample with the agent).
[0273] The affinity separation results in at least two, and sometimes three or more partitions of nucleic acids with different extents of a modification. While the partitions are still separate, the nucleic acids of at least one partition, and usually two or three (or more) partitions are linked to nucleic acid tags, usually provided as components of adapters, with the nucleic acids in different partitions receiving different tags that distinguish members of one partition from another. The tags linked to nucleic acid molecules of the same partition can be the same or different from one another. But if different from one another, the tags may have part of their code in common so as to identify the molecules to which they are attached as being of a particular partition.
[0274] For further details regarding portioning nucleic acid samples based on characteristics such as methylation, see WO2018 / 119452, which is incorporated herein by reference.
[0275] In some embodiments, the nucleic acid molecules can be fractionated into different partitions based on the nucleic acid molecules that are bound to a specific protein or a fragment thereof and those that are not bound to that specific protein or fragment thereof.
[0276] Nucleic acid molecules can be fractionated based on DNA-protein binding. Protein-DNA complexes can be fractionated based on a specific property of a protein. Examples of such properties include various epitopes, modifications (e.g., histone methylation or acetylation) or enzymatic activity. Examples of proteins which may bind to DNA and serve as a basis for fractionation may include, but are not limited to, protein A and protein G. Any suitable method can be used to fractionate the nucleic acid molecules based on protein bound regions. Examples ofAttorney Ref. No.: GH0230WOmethods used to fractionate nucleic acid molecules based on protein bound regions include, but are not limited to, SDS-PAGE, chromatin-immuno-precipitation (ChIP), heparin chromatography, and asymmetrical field flow fractionation (AF4).
[0277] In some embodiments, the partitioning of the sample into a plurality of subsamples is performed by contacting the nucleic acids with an antibody that recognizes a modified nucleobase in the DNA, which may be is a modified cytosine or a product of the procedure that affects the first nucleobase in the DNA differently from the second nucleobase in the DNA of the sample. In some embodiments, the modified nucleobase is 5mC. In some embodiments, the modified nucleobase is 5caC. In some embodiments, the modified nucleobase is dihydrouracil (DHU). In some embodiments, the antibody that recognizes a modified nucleobase in the DNA is used to partition single-stranded DNA.
[0278] In some embodiments, the partitioning is performed by contacting the nucleic acids with a methyl binding domain (“MBD”) of a methyl binding protein (“MBP”). In some such embodiments, the nucleic acids are contacted with an entire MBP. In some embodiments, an MBD binds to 5-methylcytosine (5mC), and an MBP comprises an MBD and is referred to interchangeably herein as a methyl binding protein or a methyl binding domain protein. In some embodiments, an MBD binds to 5mC and 5hmC. In some embodiments, MBD is coupled to paramagnetic beads, such as Dynabeads® M-280 Streptavidin via a biotin linker. Partitioning into fractions with different extents of methylation can be performed by eluting fractions by increasing the NaCI concentration.
[0279] In some embodiments, bound DNA is eluted by contacting the antibody or MBD with a protease, such as proteinase K. This may be performed instead of or in addition to elution steps using NaCI as discussed above.
[0280] Examples of agents that recognize a modified nucleobase contemplated herein include, but are not limited to:
[0281] (a) MeCP2 is a protein that preferentially binds to 5-methyl-cytosine over unmodified cytosine.
[0282] (b) RPL26, PRP8 and the DNA mismatch repair protein MHS6 preferentially bind to 5- hydroxymethyl-cytosine over unmodified cytosine.
[0283] (c) FOXK1, FOXK2, FOXP1, FOXP4 and FOXI3 preferably bind to 5-formyl cytosine over unmodified cytosine (Iurlaro et al., Genome Biol. 14: R119 (2013)).
[0284] (d) Antibodies specific to one or more methylated or modified nucleobases or conversion products thereof, such as 5mC, 5caC, or DHU.Attorney Ref. No.: GH0230WO
[0285] In general, elution is a function of the number of modifications, such as the number of methylated sites per molecule, with molecules having more methylation eluting under increased salt concentrations. To elute the DNA into distinct populations based on the extent of methylation, one can use a series of elution buffers of increasing NaCI concentration. Salt concentration can range from about 100 nm to about 2500 mM NaCI. In one embodiment, the process results in three (3) partitions. Molecules are contacted with a solution at a first salt concentration and comprising a molecule comprising an agent that recognizes a modified nucleobase, which molecule can be attached to a capture moiety, such as streptavidin. At the first salt concentration a population of molecules will bind to the agent and a population will remain unbound. The unbound population can be separated as a “hypomethylated” population. For example, a first partition enriched in hypomethylated form of DNA is that which remains unbound at a low salt concentration, e.g., 100 mM or 160 mM. A second partition enriched in intermediate methylated DNA is eluted using an intermediate salt concentration, e.g., between 100 mM and 2000 mM concentration. This is also separated from the sample. A third partition enriched in hypermethylated form of DNA is eluted using a high salt concentration, e.g., at least about 2000 mM.
[0286] In some embodiments, a monoclonal antibody raised against 5-methylcytidine (5mC) is used to purify methylated DNA. DNA is denatured, e.g., at 95°C in order to yield singlestranded DNA fragments. Protein G coupled to standard or magnetic beads as well as washes following incubation with the anti-5mC antibody are used to immunoprecipitate DNA bound to the antibody. Such DNA may then be eluted. Partitions may comprise unprecipitated DNA and one or more partitions eluted from the beads.
[0287] In some embodiments, sample DNA (e.g., between 5 and 200 ng) is mixed with methyl binding domain (MBD) buffer and magnetic beads conjugated with MBD proteins and incubated overnight. Methylated DNA (hypermethylated DNA) binds the MBD protein on the magnetic beads during this incubation. Non-methylated (hypomethylated DNA) or less methylated DNA (intermediately methylated) is washed away from the beads with buffers containing increasing concentrations of salt. For example, one, two, or more fractions containing non-methylated, hypomethylated, and / or intermediately methylated DNA may be obtained from such washes. Finally, a high salt buffer is used to elute the heavily methylated DNA (hypermethylated DNA) from the MBD protein. In some embodiments, these washes result in three partitions (hypomethylated partition, intermediately methylated fraction and hypermethylated partition) of DNA having increasing levels of methylation.Attorney Ref. No.: GH0230WO
[0288] In some embodiments, partitioning procedures may result in imperfect sorting of DNA molecules among the subsamples. For example, a minority of the molecules in an unmethylated or hypomethylated subsample may be highly modified (e.g., hypermethylated), and / or a minority of the molecules in a hypermethylated subsample may be unmodified or mostly unmodified (e.g., unmethylated or mostly unmethylated). Such molecules are considered nonspecifically partitioned.
[0289] In some embodiments, nonspecifically partitioned molecules are removed using a methylation-dependent nuclease, e.g., a methylation dependent restriction enzyme (MDRE), digesting / cleaving the DNA where the restriction enzyme (RE) recognition site contains a methylated nucleotide but not cleaving the DNA where the restriction enzyme (RE) recognition site contains an unmethylated nucleotide. In some embodiments, nonspecifically partitioned molecules are removed using a methylation sensitive nuclease, e.g., a methylation sensitive restriction enzyme (MSRE), digesting / cleaving the DNA where the restriction enzyme (RE) recognition site contains an unmethylated nucleotide but not cleaving the DNA where the restriction enzyme (RE) recognition site contains a methylated nucleotide. For example, in some embodiments, a hypomethylated subsample is contacted with a methylation-dependent nuclease, such as a methylation-dependent restriction enzyme, thereby degrading nonspecifically partitioned DNA, e.g., methylated DNA, in the subsample. Alternatively, or in addition, a hypermethylated subsample is contacted with a methylation-sensitive endonuclease, such as a methylation-sensitive restriction enzyme, thereby degrading nonspecifically partitioned DNA in the subsample.
[0290] Degradation of nonspecifically partitioned DNA in one or more partitioned subsamples may improve the performance of methods that rely on accurate partitioning of DNA on the basis of a cytosine modification. For example, such degradation may provide improved sensitivity and / or simplify downstream analyses. In some embodiments, partitioning DNA on the basis of a modification, such as methylation, then removing nonspecifically partitioned DNA using MDREs and / or MSREs as described herein provides improved efficiency and / or cost over DNA analysis methods comprising procedures that affect a first nucleobase differently from a second nucleobase, such as bisulfite sequencing or bisulfite conversion.
[0291] In some embodiments, one or more nucleases are used to degrade nonspecifically partitioned DNA molecules. In some embodiments, a subsample is contacted with a plurality of nucleases. The subsample may be contacted with the nucleases sequentially or simultaneously. Simultaneous use of nucleases may be advantageous when the nucleases are active under similar conditions (e.g., buffer composition) to avoid unnecessary sample manipulation.Attorney Ref. No.: GH0230WOContacting a subsample with more than one methylation-dependent restriction enzyme can more completely degrade nonspecifically partitioned hypermethylated DNA. Contacting a subsample with more than one methylation-sensitive restriction enzyme can more completely degrade nonspecifically partitioned hypomethylated and / or unmethylated DNA.
[0292] In some embodiments, a methylation-dependent nuclease comprises one or more of MspJI, LpnPI, FspEI, or McrBC. In some embodiments, at least two methylation-dependent nucleases are used. In some embodiments, at least three methylation-dependent nucleases are used.
[0293] In some embodiments, a methylation-sensitive nuclease comprises one or more of Aatll, Accll, Acil, Aor13HI, Aor15HI, BspT104l, BssHII, BstUI, CfrWI, Clal, Cpol, Eco52l, Haell, Hapll, Hhal, Hin6l, Hpall, HpyCH4IV, Mlul, Mspl, Nael, Notl, Nrul, Nsbl, PmaCI, Psp1406l, Pvul, Sacll, Sall, Smal, and SnaBI. In some embodiments, at least two methylation-sensitive nucleases are used. In some embodiments, at least three methylation-sensitive nucleases are used. In some embodiments, the methylation-sensitive nucleases comprise BstUI and Hpall. In some embodiments, the two methylation-sensitive nucleases comprise Hhal and Accll. In some embodiments, the methylation-sensitive nucleases comprise BstUI, Hpall and Hin6l.
[0294] In some embodiments, the partitions of DNA are desalted and concentrated in preparation for enzymatic steps of library preparation.C. Adapter Ligation
[0295] In some embodiments, adapters are added to the DNA. This may be done concurrently with an amplification procedure, e.g., by providing the adapters in a 5’ portion of a primer (where PCR is used, this can be referred to as library prep-PCR or LP-PCR). In some embodiments, adapters are added by other approaches, such as ligation. In some such methods, prior to partitioning or prior to capturing, first adapters are added to the nucleic acids by ligation to the 3’ ends thereof, which may include ligation to single-stranded DNA. The adapter can be used as a priming site for second-strand synthesis, e.g., using a universal primer and a DNA polymerase. A second adapter can then be ligated to at least the 3’ end of the second strand of the now double-stranded molecule. In some embodiments, the first adapter comprises an affinity tag, such as biotin, and nucleic acid ligated to the first adapter is bound to a solid support (e.g., bead), which may comprise a binding partner for the affinity tag such as streptavidin. For further discussion of a related procedure, see Gansauge et al., Nature Protocols 8:737-748 (2013). Commercial kits for sequencing library preparation compatible with single-stranded nucleic acidsAttorney Ref. No.: GH0230WOare available, e.g., the Accel-NGS® Methyl-Seq DNA Library Kit from Swift Biosciences. In some embodiments, after adapter ligation, nucleic acids are amplified.
[0296] Preferably, the adapters include different tags of sufficient numbers that the number of combinations of tags results in a low probability e.g., 95, 99 or 99.9% of two nucleic acids with the same start and stop points receiving the same combination of tags. Adapters, whether bearing the same or different tags, can include the same or different primer binding sites, but preferably adapters include the same primer binding site.
[0297] In some embodiments, following attachment of adapters, the nucleic acids are subject to amplification. The amplification can use, e.g., universal primers that recognize primer binding sites in the adapters.
[0298] In some embodiments, following attachment of adapters, the DNA is partitioned, comprising contacting the DNA with an agent that preferentially binds to nucleic acids bearing an epigenetic modification. The nucleic acids are partitioned into at least two subsamples differing in the extent to which the nucleic acids bear the modification from binding to the agents. For example, if the agent has affinity for nucleic acids bearing the modification, nucleic acids overrepresented in the modification (compared with median representation in the population) preferentially bind to the agent, whereas nucleic acids underrepresented for the modification do not bind or are more easily eluted from the agent. The nucleic acids can then be amplified from primers binding to the primer binding sites within the adapters. Partitioning may be performed instead before adapter attachment, in which case the adapters may comprise differential tags that include a component that identifies which partition a molecule occurred in.
[0214] In some embodiments, the nucleic acids are linked at both ends to Y-shaped adapters including primer binding sites and tags. The molecules are amplified.D. Tagging
[0299] “Tagging” DNA molecules is a procedure in which a tag is attached to or associated with the DNA molecules. Tags can be molecules, such as nucleic acids, containing information that indicates a feature of the molecule with which the tag is associated. For example, molecules can bear a sample tag (which distinguishes molecules in one sample from those in a different sample) or a molecular tag / molecular barcode / barcode (which distinguishes different molecules from one another (in both unique and non-unique tagging scenarios). For methods that involve a partitioning step, a partition tag (which distinguishes molecules in one partition from those in a different partition) may be included. In some embodiments, adapters added to DNA molecules comprise tags. In certain embodiments, a tag can comprise one or a combination of barcodes. As used herein, the term “barcode” refers to a nucleic acid molecule having a particular nucleotideAttorney Ref. No.: GH0230WOsequence, or to the nucleotide sequence, itself, depending on context. A barcode can have, for example, between 10 and 100 nucleotides. A collection of barcodes can have degenerate sequences or can have sequences having a certain hamming distance, as desired for the specific purpose. So, for example, a molecular barcode can be comprised of one barcode or a combination of two barcodes, each attached to different ends of a molecule. Additionally, or alternatively, for different partitions and / or samples, different sets of molecular barcodes, or molecular tags can be used such that the barcodes serve as a molecular tag through their individual sequences and also serve to identify the partition and / or sample to which they correspond based the set of which they are a member.
[0300] In some embodiments, two or more partitions, e.g., each partition, is / are differentially tagged. Tags can be used to label the individual polynucleotide population partitions so as to correlate the tag (or tags) with a specific partition. Alternatively, tags can be used in embodiments that do not employ a partitioning step. In some embodiments, a single tag can be used to label a specific partition. In some embodiments, multiple different tags can be used to label a specific partition. In embodiments employing multiple different tags to label a specific partition, the set of tags used to label one partition can be readily differentiated for the set of tags used to label other partitions. In some embodiments, the tags may have additional functions, for example the tags can be used to index sample sources or used as unique molecular identifiers (which can be used to improve the quality of sequencing data by differentiating sequencing errors from mutations, for example as in Kinde et al., Proc Nat’l Acad Sci USA 108: 9530-9535 (2011), Kou et al., PLoS ONE, 11: e0146638 (2016)) or used as non-unique molecule identifiers, for example as described in US Pat. No. 9,598,731. Similarly, in some embodiments, the tags may have additional functions, for example the tags can be used to index sample sources or used as non-unique molecular identifiers (which can be used to improve the quality of sequencing data by differentiating sequencing errors from mutations).
[0301] In some embodiments, partition tagging comprises tagging molecules in each partition with a partition tag. After re-combining partitions (e.g., to reduce the number of sequencing runs needed and avoid unnecessary cost) and sequencing molecules, the partition tags identify the source partition. In some embodiments, the partition tags can serve as identifiers of the source partition and the molecule, i.e., different partitions are tagged with different sets of molecular tags, e.g., comprised of a pair of barcodes. In this way, the one or more molecular barcodes attached to the molecule indicates the source partition as well as being useful to distinguish molecules within a partition. For example, a first set of 35 barcodes can be used toAttorney Ref. No.: GH0230WOtag molecules in a first partition, while a second set of 35 barcodes can be used tag molecules in a second partition.
[0302] In some embodiments, after partitioning and tagging with partition tags, the molecules may be pooled for sequencing in a single run. In some embodiments, a sample tag is added to the molecules, e.g., in a step subsequent to addition of partition tags and pooling. Sample tags can facilitate pooling material generated from multiple samples for sequencing in a single sequencing run.
[0303] Alternatively, in some embodiments, partition tags may be correlated to the sample as well as the partition. As a simple example, a first tag can indicate a first partition of a first sample; a second tag can indicate a second partition of the first sample; a third tag can indicate a first partition of a second sample; and a fourth tag can indicate a second partition of the second sample.
[0304] While tags may be attached to molecules already partitioned based on one or more characteristics, the final tagged molecules in the library may no longer possess that characteristic. For example, while single stranded DNA molecules may be partitioned and tagged, the final tagged molecules in the library are likely to be double stranded. Similarly, while DNA may be subject to partition based on different levels of methylation, in the final library, tagged molecules derived from these molecules are likely to be unmethylated. Accordingly, the tag attached to molecule in the library typically indicates the characteristic of the “parent molecule” from which the ultimate tagged molecule is derived, not necessarily to characteristic of the tagged molecule, itself.
[0305] As an example, barcodes 1, 2, 3, 4, etc. are used to tag and label molecules in the first partition; barcodes A, B, C, D, etc. are used to tag and label molecules in the second partition; and barcodes a, b, c, d, etc. are used to tag and label molecules in the third partition. Differentially tagged partitions can be pooled prior to sequencing. Differentially tagged partitions can be separately sequenced or sequenced together concurrently, e.g., in the same flow cell of an Illumina sequencer.
[0306] After sequencing, analysis of reads can be performed on a partition-by-partition level, as well as a whole DNA population level. Tags are used to sort reads from different partitions. Analysis can include in silico analysis to determine genetic and epigenetic variation (one or more of methylation, chromatin structure, etc.) using sequence information, genomic coordinates length, coverage, and / or copy number. In some embodiments, higher coverage can correlate with higher nucleosome occupancy in genomic region while lower coverage can correlate with lower nucleosome occupancy or a nucleosome depleted region (NDR).Attorney Ref. No.: GH0230WOE. Enriching / Capturing step; Amplification
[0307] Methods disclosed herein can comprise capturing DNA, such as cfDNA target regions. In some embodiments, the capturing comprises contacting the DNA with probes (e.g., oligonucleotides) specific for the target regions. Enrichment or capture may be performed on any sample or subsample described herein using any suitable approach known in the art.
[0308] In some embodiments, enrichment or capture is performed after attachment of adapters to sample molecules. In some embodiments, enrichment or capture is performed after a partitioning step. In some embodiments, enrichment or capture is performed after an amplification step. In some embodiments, sample molecules are partitioned, then adapters are attached, then sample molecules are amplified, and then the amplified molecules are subjected to enrichment or capture. The enriched or captured molecules may then be subjected to another amplification and then sequenced.
[0309] In some embodiments, the probes specific for the target regions comprise a capture moiety that facilitates the enrichment or capture of the DNA hybridized to the probes. In some embodiments, the capture moiety is biotin. In some such embodiments, streptavidin attached to a solid support, such as magnetic beads, is used to bind to the biotin. Nonspecifically bound DNA that does not comprise a target region is washed away from the captured DNA. In some embodiments, DNA is then dissociated from the probes and eluted from the solid support using salt washes or buffers comprising another DNA denaturing agent. In some embodiments, the probes are also eluted from the solid support by, e.g., disrupting the biotin-streptavidin interaction. In some embodiments, captured DNA is amplified following elution from the solid support. In some such embodiments, DNA comprising adapters is amplified using PCR primers that anneal to the adapters. In some embodiments, captured DNA is amplified while attached to the solid support. In some such embodiments, the amplification comprises use of a PCR primer that anneals to a sequence within an adapter and a PCR primer that anneals to a sequence within a probe annealed to the target region of the DNA.
[0310] In some embodiments, the methods herein comprise enriching for or capturing DNA comprising epigenetic and / or sequence-variable target regions. Such regions may be captured from an aliquot of a sample (e.g., a sample that has undergone attachment of adapters and amplification), while the step of partitioning the DNA with an agent that recognizes a modified cytosine, such as methyl cytosine, is performed on a separate aliquot of the sample. Enriching for or capturing DNA comprising epigenetic and / or sequence-variable target regions may comprise contacting the DNA with a first or second set of target-specific probes. Such target-specific probesAttorney Ref. No.: GH0230WOmay have any of the features described herein for sets of target-specific probes, including but not limited to in the embodiments set forth above and the sections relating to probes below. Capturing may be performed on one or more subsamples prepared during methods disclosed herein. In some embodiments, DNA is captured from the first subsample or the second subsample, e.g., the first subsample and the second subsample. In some embodiments, the subsamples are differentially tagged (e.g., as described herein) and then pooled before undergoing capture. Exemplary methods for capturing DNA comprising epigenetic and / or sequence-variable target regions can be found in, e.g., WO 2020 / 160414, which is hereby incorporated by reference.
[0311] The capturing step may be performed using conditions suitable for specific nucleic acid hybridization, which generally depend to some extent on features of the probes such as length, base composition, etc. Those skilled in the art will be familiar with appropriate conditions given general knowledge in the art regarding nucleic acid hybridization. In some embodiments, complexes of target-specific probes and DNA are formed.
[0312] In some embodiments, methods described herein comprise capturing a plurality of sets of target regions of cfDNA obtained from a subject. The target regions may comprise differences depending on whether they originated from a tumor or from healthy cells or from a certain cell type. The capturing step produces a captured set of cfDNA molecules. In some embodiments, cfDNA molecules corresponding to a sequence-variable target region set are captured at a greater capture yield in the captured set of cfDNA molecules than cfDNA molecules corresponding to an epigenetic target region set. In some embodiments, a method described herein comprises contacting cfDNA obtained from a subject with a set of target-specific probes, wherein the set of target-specific probes is configured to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than cfDNA corresponding to the epigenetic target region set. For additional discussion of capturing steps, capture yields, and related aspects, see W02020 / 160414, which is incorporated herein by reference for all purposes.
[0313] It can be beneficial to capture cfDNA corresponding to the sequence-variable target region set at a greater capture yield than cfDNA corresponding to the epigenetic target region set because a greater depth of sequencing may be necessary to analyze the sequencevariable target regions with sufficient confidence or accuracy than may be necessary to analyze the epigenetic target regions. The volume of data needed to determine fragmentation patterns (e.g., to test for perturbation of transcription start sites or CTCF binding sites) or fragment abundance (e.g., in hypermethylated and hypomethylated partitions) is generally less than the volume of data needed to determine the presence or absence of cancer-related sequence mutations. Capturing the target region sets at different yields can facilitate sequencing the targetAttorney Ref. No.: GH0230WOregions to different depths of sequencing in the same sequencing run (e.g., using a pooled mixture and / or in the same sequencing cell).
[0314] In some embodiments, the DNA is amplified. In some embodiments, amplification is performed before the capturing step. In some embodiments, amplification is performed after the capturing step. In some embodiments, amplification is performed before and after the capturing step. In various embodiments, the methods further comprise sequencing the captured DNA, e.g., to different degrees of sequencing depth for the epigenetic and sequence-variable target region sets, consistent with the discussion herein.
[0315] In some embodiments, a capturing step is performed with probes for a sequencevariable target region set and probes for an epigenetic target region set in the same vessel at the same time, e.g., the probes for the sequence-variable and epigenetic target region sets are in the same composition. This approach provides a relatively streamlined workflow. In some embodiments, the concentration of the probes for the sequence-variable target region set is greater that the concentration of the probes for the epigenetic target region set.
[0316] Alternatively, a capturing step is performed with a sequence-variable target region probe set in a first vessel and with an epigenetic target region probe set in a second vessel, or a contacting step is performed with a sequence-variable target region probe set at a first time and a first vessel and an epigenetic target region probe set at a second time before or after the first time. This approach allows for preparation of separate first and second compositions comprising captured DNA corresponding to a sequence-variable target region set and captured DNA corresponding to an epigenetic target region set. The compositions can be processed separately as desired (e.g., to partition based on methylation as described herein) and pooled in appropriate proportions to provide material for further processing and analysis such as sequencing.
[0317] In some embodiments, adapters are included in the DNA as described herein. In some embodiments, tags, which may be or include barcodes, are included in the DNA. In some embodiments, such tags are included in adapters. Tags can facilitate identification of the origin of a nucleic acid. For example, barcodes can be used to allow the origin (e.g., subject) whence the DNA came to be identified following pooling of a plurality of samples for parallel sequencing. This may be done concurrently with an amplification procedure, e.g., by providing the barcodes in a 5’ portion of a primer, e.g., as described herein. In some embodiments, adapters and tags / barcodes are provided by the same primer or primer set. For example, the barcode may be located 3’ of the adapter and 5’ of the target-hybridizing portion of the primer. Alternatively, barcodes can be added by other approaches, such as ligation, optionally together with adapters in the same ligation substrate.Attorney Ref. No.: GH0230WO
[0318] Additional details regarding amplification, tags, and barcodes are discussed herein, which can be combined to the extent practicable with any of these embodiments.F. Procedures that affect a first nucleobase in the DNA differently from a second nucleobase in the DNA or methylation-sensitive conversion methods
[0319] In some embodiments, methods disclosed herein comprise a step of subjecting DNA, or a subsample thereof, to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA, wherein the first nucleobase is a modified or unmodified nucleobase, the second nucleobase is a modified or unmodified nucleobase different from the first nucleobase, and the first nucleobase and the second nucleobase have the same base pairing specificity. In some embodiments, the procedure chemically converts the first or second nucleobase such that the base pairing specificity of the converted nucleobase is altered. In some embodiments, DNA is subjected to a procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA before library preparation using the DNA, before a first amplification of the DNA, before dividing the DNA into a plurality of subsamples, or any combination thereof. In certain embodiments, the DNA is subjected to the procedure before or after contacting the DNA with a methylation-sensitive nuclease.
[0320] In some embodiments, the procedure that affects a first nucleobase of the DNA differently from a second nucleobase of the DNA is performed prior to the sequencing and / or (a) prior to or after the selectively depleting the target nucleic acid comprising the wild-type sequence, the target nucleic acid comprising the converted nucleotide, or the target nucleic acid that does not comprise the converted nucleotide; (b) prior to the amplifying the selectively digested population of target nucleic acids; (c) prior to or after the partitioning the population of target nucleic acids into a plurality of subsamples; and / or (d) prior to or after a step of enriching for one or more sets of target regions of DNA.
[0321] In some embodiments, if the first nucleobase is a modified or unmodified adenine, then the second nucleobase is a modified or unmodified adenine; if the first nucleobase is a modified or unmodified cytosine, then the second nucleobase is a modified or unmodified cytosine; if the first nucleobase is a modified or unmodified guanine, then the second nucleobase is a modified or unmodified guanine; and if the first nucleobase is a modified or unmodified thymine, then the second nucleobase is a modified or unmodified thymine (where modified and unmodified uracil are encompassed within modified thymine for the purpose of this step).
[0322] In some embodiments, the first nucleobase is a modified or unmodified cytosine, then the second nucleobase is a modified or unmodified cytosine. For example, first nucleobaseAttorney Ref. No.: GH0230WOmay comprise unmodified cytosine (C) and the second nucleobase may comprise one or more of 5-methylcytosine (mC) and 5-hydroxymethylcytosine (hmC). Alternatively, the second nucleobase may comprise C and the first nucleobase may comprise one or more of mC and hmC. Other combinations are also possible, such as where one of the first and second nucleobases comprises mC and the other comprises hmC.
[0323] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises bisulfite conversion. Treatment with bisulfite converts unmodified cytosine and certain modified cytosine nucleotides (e.g. 5-formyl cytosine (fC) or 5-carboxylcytosine (caC)) to uracil whereas other modified cytosines (e.g., 5-methylcytosine, 5-hydroxylmethylcystosine) are not converted. Thus, where bisulfite conversion is used, the first nucleobase comprises one or more of unmodified cytosine, 5-formyl cytosine, 5-carboxylcytosine, or other cytosine forms affected by bisulfite, and the second nucleobase may comprise one or more of mC and hmC, such as mC and optionally hmC. Sequencing of bisulfite-treated DNA identifies positions that are read as cytosine as being mC or hmC positions. Meanwhile, positions that are read as T are identified as being T or a bisulfite-susceptible form of C, such as unmodified cytosine, 5-formyl cytosine, or 5-carboxylcytosine. Performing bisulfite conversion, such as on a DNA sample as described herein, facilitates identifying positions containing mC or hmC using the sequence reads obtained from the exemplary sample. For an exemplary description of bisulfite conversion, see, e.g., Moss et al., Nat Commun. 2018; 9: 5068.
[0324] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises oxidative bisulfite (Ox-BS) conversion. This procedure first converts hmC to fC, which is bisulfite susceptible, followed by bisulfite conversion. Thus, when oxidative bisulfite conversion is used, the first nucleobase comprises one or more of unmodified cytosine, fC, caC, hmC, or other cytosine forms affected by bisulfite, and the second nucleobase comprises mC. Sequencing of Ox-BS converted DNA identifies positions that are read as cytosine as being mC positions. Meanwhile, positions that are read as T are identified as being T, hmC, or a bisulfite-susceptible form of C, such as unmodified cytosine, fC, or hmC. Performing Ox-BS conversion, such as on a DNA sample as described herein, thus facilitates identifying positions containing mC using the sequence reads obtained from the sample. For an exemplary description of oxidative bisulfite conversion, see, e.g., Booth et al., Science 2012; 336: 934-937.
[0325] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises Tet-assisted bisulfite (TAB) conversion. In TAB conversion, hmC is protected from conversion and mC is oxidized in advanceAttorney Ref. No.: GH0230WOof bisulfite treatment, so that positions originally occupied by mC are converted to U while positions originally occupied by hmC remain as a protected form of cytosine. For example, as described in Yu et al., Cell 2012; 149: 1368-80, [3-glucosyl transferase can be used to protect hmC (forming 5-glucosylhydroxymethylcytosine (ghmC)), then a TET protein such as mTetl can be used to convert mC to caC, and then bisulfite treatment can be used to convert C and caC to U while ghmC remains unaffected.
[0326] Alternatively, a carbamoyltransferase enzyme, such as 5-hydroxymethylcytosine carbamoyltransferase as described in Yang et al., Bio-protocol, 2023; 12(17): e4496, can be used to protect hmC (by converting hmC to 5-carbamoyloxymethylcytosine (5cmC)), then a TET protein such as mTetl or a TET2 comprising a T1372S mutation, can be used to convert mC to caC, and then bisulfite treatment can be used to convert C and caC to U while 5cmC remains unaffected. Thus, when TAB conversion is used, the first nucleobase comprises one or more of unmodified cytosine, fC, caC, mC, or other cytosine forms affected by bisulfite, and the second nucleobase comprises hmC. Sequencing of TAB-converted DNA identifies positions that are read as cytosine as being hmC positions. Meanwhile, positions that are read as T are identified as being T, mC, or a bisulfite-susceptible form of C, such as unmodified cytosine, fC, or caC. Performing TAB conversion, such as on a DNA sample as described herein, thus facilitates identifying positions containing hmC using the sequence reads obtained from the sample.
[0327] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises Tet-assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane. In Tet-assisted pic-borane conversion with a substituted borane reducing agent conversion, a TET protein is used to convert mC and hmC to caC, without affecting unmodified C. caC, and fC if present, are then converted to dihydrouracil (DHU) by treatment with 2-picoline borane (pic-borane) or another substituted borane reducing agent such as borane pyridine, tert-butylamine borane, or ammonia borane, also without affecting unmodified C. See, e.g., Liu et al., Nature Biotechnology 2019; 37:424-429 (e.g., at Supplementary Fig. 1 and Supplementary Note 7). Thus, when this type of conversion is used, the first nucleobase comprises one or more of 5mC, 5fC, 5caC, or 5hmC, and the second nucleobase comprises unmodified cytosine. DHU is read as a T in sequencing. Thus, when this type of conversion is used, the first nucleobase comprises one or more of mC, fC, caC, or hmC, and the second nucleobase comprises unmodified cytosine. Sequencing of the converted DNA identifies positions that are read as cytosine as being unmodified C positions. Meanwhile, positions that are read as T are identified as being T, mC, fC, caC, or hmC. Performing TAPAttorney Ref. No.: GH0230WOconversion, such as on a DNA sample as described herein, thus facilitates identifying positions containing unmodified C using the sequence reads obtained from the sample. This procedure encompasses Tet-assisted pyridine borane sequencing (TAPS), described in further detail in Liu et al. 2019, supra.
[0328] Alternatively, protection of hmC (e.g., using pGT or 5-hydroxymethylcytosine carbamoyltransferase) can be combined with Tet-assisted conversion with a substituted borane reducing agent, e.g. as described above. In this method (TAPS-P), 5hmC can be protected from conversion, for example through glucosylation using p-glucosyl transferase (PGT), forming 5-glucosylhydroxymethylcytosine (5ghmC), or through carbamoylation using 5-hydroxymethylcytosine carbamoyltransferase, forming 5cmC. This is described in Yu et al., Cell 2012; 149: 1368-80. Treatment with a TET protein, such as mTetl or a TET2 comprising a T1372S mutation, then converts mC to caC but does not convert C, 5ghmC, or 5cmC. 5caC is then converted to DHU by treatment with pic-borane or another substituted borane reducing agent such as borane pyridine, tert-butylamine borane, or ammonia borane, also without affecting ghmC, 5cmC, or unmodified C. Thus, when Tet-assisted conversion with a substituted borane reducing agent is used, the first nucleobase comprises mC, and the second nucleobase comprises one or more of unmodified cytosine or hmC, such as unmodified cytosine and optionally hmC, fC, and / or caC. Sequencing of the converted DNA identifies positions that are read as cytosine as being either hmC or unmodified C positions. Meanwhile, positions that are read as T are identified as being T, fC, caC, or mC. Performing TAPSp conversion, such as on a DNA sample as described herein, thus facilitates distinguishing positions containing unmodified C or hmC on the one hand from positions containing mC using the sequence reads obtained from the sample. For an exemplary description of this type of conversion, see, e.g., Liu et al., Nature Biotechnology 2019; 37:424-429. 5-hydroxymethylcytosine carbamoyltransferase is described in Yang et al., Bio-protocol, 2023; 12(17): e4496.
[0329] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises chemical-assisted conversion with a substituted borane reducing agent, optionally wherein the substituted borane reducing agent is 2-picoline borane, borane pyridine, tert-butylamine borane, or ammonia borane. In chemical-assisted conversion with a substituted borane reducing agent, an oxidizing agent such as potassium perruthenate (KRuO4) (also suitable for use in ox-BS conversion) is used to specifically oxidize hmC to fC. Treatment with pic-borane or another substituted borane reducing agent such as borane pyridine, tert-butylamine borane, or ammonia borane converts fC and caC to DHU but does not affect mC or unmodified C. Thus, when this type of conversion is used, the firstAttorney Ref. No.: GH0230WOnucleobase comprises one or more of hmC, fC, and caC, and the second nucleobase comprises one or more of unmodified cytosine or mC, such as unmodified cytosine and optionally mC. Sequencing of the converted DNA identifies positions that are read as cytosine as being either mC or unmodified C positions. Meanwhile, positions that are read as T are identified as being T, fC, caC, or hmC. Performing this type of conversion, such as on a DNA sample as described herein, thus facilitates distinguishing positions containing unmodified C or mC on the one hand from positions containing hmC using the sequence reads obtained from the sample. For an exemplary description of this type of conversion, see, e.g., Liu et al., Nature Biotechnology 2019; 37:424-429.
[0330] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises APOBEC-coupled epigenetic (ACE) conversion. In ACE conversion, an AID / APOBEC family DNA deaminase enzyme such as APOBEC3A (A3A) is used to deaminate unmodified cytosine and mC without deaminating hmC, fC, or caC. Thus, when ACE conversion is used, the first nucleobase comprises unmodified C and / or mC (e.g., unmodified C and optionally mC), and the second nucleobase comprises hmC. Sequencing of ACE-converted DNA identifies positions that are read as cytosine as being hmC, fC, or caC positions. Meanwhile, positions that are read as T are identified as being T, unmodified C, or mC. Performing ACE conversion on a DNA sample as described herein thus facilitates distinguishing positions containing hmC from positions containing mC or unmodified C using the sequence reads obtained from the sample. For an exemplary description of ACE conversion, see, e.g., Schutsky et al., Nature Biotechnology 2018; 36: 1083-1090.
[0331] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises enzymatic conversion of the first nucleobase, e.g., as in EM-Seq. See, e.g., Vaisvila R, et al. (2019) EM-seq: Detection of DNA methylation at single base resolution from picograms of DNA. bioRxiv; DOI: 10.1101 / 2019.12.20.884692, available at www.biorxiv.org / content / 10.1101 / 2019.12.20.884692v1. For example, TET2 and T4-[3GT or 5-hydroxymethylcytosine carbamoyltransferase (described in Yang et al., Bio-protocol, 2023; 12(17): e4496) can be used to convert 5mC and 5hmC into substrates that cannot be deaminated by a deaminase (e.g., APOBEC3A), and then a deaminase (e.g., APOBEC3A) can be used to deaminate unmodified cytosines converting them to uracils.
[0332] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises enzymatic conversion of the first nucleobase using a non-specific, modification-sensitive double-stranded DNA deaminase, e.g.,Attorney Ref. No.: GH0230WOas in SEM-seq. See, e.g., Vaisvila et al. (2023) Discovery of novel DNA cytosine deaminase activities enables a nondestructive single-enzyme methylation sequencing method for base resolution high-coverage methylome mapping of cell-free and ultra-low input DNA. bioRxiv; DOI: 10.1101 / 2023.06.29.547047, available at https: / / www.biorxiv.org / content / 10.1101 / 2023.06.29.547047v1. SEM-Seq employs a nonspecific, modification-sensitive double-stranded DNA deaminase (MsddA) in a nondestructive single-enzyme 5-methylctyosine sequencing (SEM-seq) method that deaminates unmodified cytosines. Accordingly, SEM-seq does not require the TET2 and T4-[3GT or 5-hydroxymethylcytosine carbamoyltransferase protection and denaturing steps that are of use, e.g., in APOEC3A-based protocols. Additionally, MsddA does not deaminate 5-formylated cytosines (5fC) or 5-carboxylated cytosines (5caC). In SEM-seq, unmodified cytosines in the DNA are deaminated to uracil and is read as “T” during sequencing. Modified cytosines (e.g., 5mC) are not converted and are read as “C” during sequencing. Cytosines that are read as thymines are identified as unmodified (e.g., unmethylated) cytosines or as thymines in the DNA. Performing SEM-seq conversion thus facilitates identifying positions containing 5mC using the sequence reads obtained. In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA comprises enzymatic conversion of the first nucleobase using MsddA.
[0333] In some embodiments, the procedure that affects a first nucleobase in the DNA differently from a second nucleobase in the DNA of the first subsample converts a modified nucleoside. In some embodiments, the conversion procedure which converts a modified nucleosides comprises enzymatic conversion, such as DM-seq, for example, as described in WO2023 / 288222A1. In DM-seq, unmodified cytosines in the DNA are enzymatically protected from a subsequent deamination step wherein 5mC in 5mCpG is converted to T. The enzymatically protected unmodified (e.g., unmethylated) cytosines are not converted and are read as “C” during sequencing. Cytosines that are read as thymines (in a CpG context) are identified as methylated cytosines in the DNA. Thus, when this type of conversion is used, the first nucleobase comprises unmodified (such as unmethylated) cytosine, and the second nucleobase comprises modified (such as methylated) cytosine. Sequencing of the converted DNA identifies positions that are read as cytosine as being unmodified C positions. Meanwhile, positions that are read as T are identified as being T or 5mC. Performing DM-seq conversion thus facilitates identifying positions containing 5mC using the sequence reads obtained.
[0334] Exemplary cytosine deaminases for use herein include APOBEC enzymes, for example, APOBEC3A. Generally, AID / APOBEC family DNA deaminase enzymes such asAttorney Ref. No.: GH0230WOAP0BEC3A (A3A) are used to deaminate (unprotected) unmodified cytosine and 5mC. For an exemplary description of APOBEC conversion, see, e.g., Schutsky et al., Nature Biotechnology 2018; 36: 1083-1090.
[0335] The enzymatic protection of unmodified cytosines in the DNA comprises addition of a protective group to the unmodified cytosines. Such protective groups can comprise an alkyl group, an alkyne group, a carboxyl group, a carboxyalkyl group, an amino group, a hydroxymethyl group, a glucosyl group, a glucosylhydroxymethyl group, an isopropyl group, or a dye. For example, DNA can be treated with a methyltransferase, such as a CpG-specific methyltransferase, which adds the protective group to unmodified cytosines. The term methyltransferase is used broadly herein to refer to enzymes capable of transferring a methyl or substituted methyl (e.g..carboxymethyl) to a substrate (e.g., a cytosine in a nucleic acid). In some embodiments, the DNA is contacted with a CpG-specific DNA methyltransferase (MTase), such as a CpG-specific carboxymethyltransferase (CxMTase), and a substituted methyl donor, such as a carboxymethyl donor (e.g., carboxymethyl-S-adenosyl-L-methionine). See, e.g., WO2021 / 236778A2. In particular embodiments, the CxMTase can facilitate the addition of a protective carboxymethyl group to an unmethylated cytosine. In some embodiments, the unmethylated cytosine is unmodified cytosine. The carboxymethyl group can prevent deamination of the cytosine during a deamination step (such as a deamination step using an APOBEC enzyme, such as A3A). Substituted methyl or carboxymethyl donors useful in the disclosed methods include but are not limited to, S-adenosyl-L-methionine (SAM) analogs, optionally wherein the SAM analog is carboxy-S-adenosyl-L-methionine (CxSAM). SAM analogs are described, for example, in WO2022 / 197593A1. The MTase may be, for example, a CpG methyltransferase from Spiroplasma sp. strain MQ1 (M. Sssl), DNA-methyltransferase 1 (DNMT1), DNA-methyltransferase 3 alpha (DNMT3A), DNA-methyltransferase 3 beta (DNMT3B), or DNA adenine methyltransferase (Dam). The CxMTase may be a CpG methyltransferase from Mycoplasma penetrans (M. Mpel). In a particular embodiment, the methyltransferase enzyme is a variant of M. Mpel, wherein the amino acid corresponding to position 374 is R or K.
[0336] In one embodiment, the methyltransferase enzyme is a variant of M. Mpel having an N374R substitution or an N374K substitution. The methyltransferase variant can further comprise one or more amino acid substitutions selected from a) substitution of one or both residues T300 and E305 with S, A, G, Q, D, or N; b) substitution of one or more residues A323, N306, and Y299 with a positively charged amino acid selected from K, R or H; and / or c) substitution of S323 with A, G, K, R or H, which may enhance the activity of the enzyme.Attorney Ref. No.: GH0230WO
[0337] Optionally, the conversion procedure further includes enzymatic protection of 5hmCs, such as by glucosylation of the 5hmCs (e.g., using PGT) or by carbamoylation of the 5hmCs (e.g., using 5-hydroxymethylcytosine carbamoyltransferase), in the DNA prior to the deamination of unprotected modified cytosines. In this method, 5hmC can be protected from conversion, for example through glucosylation using p-glucosyl transferase (PGT), forming (5-glucosylhydroxymethylcytosine) 5ghmC, or through carbamoylation using 5-hydroxymethylcytosine carbamoyltransferase, forming 5cmC. This is described, for example, in Yu et al., Cell 2012; 149: 1368-80, and in Yang et al., Bio-protocol, 2023; 12(17): e4496. Glucosylation or carbamoylation of 5hmC can reduce or eliminate deamination of 5hmC by a deaminase such as APOBEC3A. Treatment with an MTase or CxMTase then adds a protecting group to unmodified (unmethylated) cytosines in the DNA. 5mC (but not protected, unmodified cytosine and not 5ghmC or 5cmC) is then deaminated (converted to T in the case of 5mC) by treatment with a deaminase, for example, an APOBEC enzyme (such as APOBEC3A). Sequencing of the converted DNA identifies positions that are read as cytosine as being either 5hmC or unmodified C positions. Meanwhile, positions that are read as T are identified as being T or 5mC. Performing DM-seq conversion with glucosylation of 5hmC on a sample as described herein thus facilitates distinguishing positions containing unmodified C or 5hmC on the one hand from positions containing 5mC using the sequence reads obtained.
[0338] Also provided herein are methods in which alternative base conversion schemes are used. For example, unmethylated cytosines can be left intact while methylated cytosines and hydroxymethylcytosines are converted to a base read as a thymine (e.g., uracil, thymine, or dihydrouracil).
[0339] In some embodiments, methylating a cytosine in at least one first complementary strand or second complementary strand comprises contacting the cytosine with a methyltransferase such as DNMT1 or DNMT5. In such embodiments, the step of oxidizing a 5-hydroxymethylated cytosine to 5-formylcytosine (such as by contacting the 5- hydroxymethyl cytosine in a first strand and a second strand with KRuO4) can be optional.
[0340] In some embodiments, converting the modified cytosine in at least one first or second strand to a thymine or a base read as thymine comprises oxidizing a hydroxymethyl cytosine, e.g., the hydroxymethyl cytosine is oxidized to formylcytosine. In some embodiments, oxidizing the hydroxymethyl cytosine to formylcytosine comprises contacting the hydroxymethyl cytosine with a ruthenate, such as potassium ruthenate (KRuO4).
[0341] In some embodiments, the modified cytosine is converted to thymine, uracil, or dihydrouracil. In any such embodiments, amplification methods may comprise uracil- and / orAttorney Ref. No.: GH0230WOdihydrouracil-tolerant amplification methods, such as PCR using a uracil- and / or dihydrouracil-tolerant DNA polymerase.
[0342] In some embodiments, the method comprises converting a formylcytosine and / or a methylcytosine to carboxylcytosine as part of converting the modified cytosine in at least one first or second strand to a thymine or a base read as thymine. For example, converting the formylcytosine and / or the methylcytosine to carboxylcytosine can comprise contacting the formylcytosine and / or the methylcytosine with a TET enzyme, such as TET1, TET2, TET3, or a TET2 comprising a T1372S mutation. In some embodiments, the method comprises reducing the carboxylcytosine as part of converting the modified cytosine in at least one first or second strand to a thymine or a base read as thymine, and / or the carboxylcytosine is reduced to dihydrouracil. In some embodiments, reducing the carboxylcytosine comprises contacting the carboxylcytosine with a borane or borohydride reducing agent.
[0343] In some embodiments, the borane or borohydride reducing agent comprises pyridine borane, 2-picoline borane, borane, tert-butylamine borane, ammonia borane, sodium borohydride, sodium cyanoborohydride (NaBH3CN), lithium borohydride (LiBH4), ethylenediamine borane, dimethylamine borane, sodium triacetoxyborohydride, morpholine borane, 4-methylmorpholine borane, trimethylamine borane, dicyclohexylamine borane, or a salt thereof. In other embodiments, the reducing agent comprises lithium aluminum hydride, sodium amalgam, amalgam, sulfur dioxide, dithionate, thiosulfate, iodide, hydrogen peroxide, hydrazine, diisobutylaluminum hydride, oxalic acid, carbon monoxide, cyanide, ascorbic acid, formic acid, dithiothreitol, beta-mercaptoethanol, or any combination thereof.
[0344] Various TET enzymes may be used in the disclosed methods as appropriate. In some embodiments, the one or more TET enzymes comprise TETv. TETv is described in US Patent 10,260,088. In some embodiments, the one or more TET enzymes comprise TETcd. TET cd is described in US Patent 10,260,088. In some embodiments, the one or more TET enzymes comprise TET 1. In some embodiments, the one or more TET enzymes comprise TET2. TET2 may be expressed and used as a fragment comprising TET2 residues 1129-1480 joined to TET2 residues 1844-1936 by a linker as described, e.g., in US Patent 10,961,525. In some embodiments, the one or more TET enzymes comprise TET1 and TET2. In some embodiments, the one or more TET enzymes comprise a V1900 TET mutant, such as a V1900A, V1900C, V1900G, V1900I, or V1900P TET mutant. In some embodiments, the one or more TET enzymes comprise a V1900 TET2 mutant, such as a V1900A, V1900C, V1900G, V1900I, or V1900P TET2 mutant. It can be beneficial to use a TET enzyme that maximizes formation of 5-carboxylcytosine (5-caC) relative to less oxidized modified cytosines, particularly 5-formylcytosine, because 5-caCAttorney Ref. No.: GH0230WOis not a substrate for enzymatic deamination, e.g., by APOBEC enzymes such as APOBEC3A. Maximizing formation of 5-caC thus reduces the risk of false calls in which a base is identified as unmethylated because it underwent deamination even though it was methylated (or hydroxymethylated) in the original sample. Accordingly, in some embodiments, the TET enzyme comprises a mutation that increases formation of 5-caC. In some embodiments, the one or more TET enzymes comprise a TET2 enzyme comprising a T1372S mutation, such as TET2-CS-T1372S and TET2-CD-T1372S. A TET2 comprising a T1372S mutation is described in US Patent 10,961,525 and may be expressed and used as a fragment comprising TET2 residues 1129-1480 joined to TET2 residues 1844-1936 by a linker. Position 1372 of TET2 corresponds to position 258 of SEQ ID NO: 21 (wild type TET2 catalytic domain) of US Patent 10,961,525. Thus, the sequence of a T1372S TET2 catalytic domain may be obtained by changing the threonine at position 258 of SEQ ID NO: 21 of US Patent 10,961,525 to serine. TET2 comprising a T1372S mutation is also described in Liu et al., Nat Chem Biol. 2017 February; 13(2): 181-187. As demonstrated in Liu et al., TET2 comprising a T1372S mutation can more efficiently oxidize 5mC to produce 5-carboxylcytosine (5caC) than other versions of TET2 such as TET2 lacking a T1372S mutation. In some embodiments, the TET2 enzyme is a human TET2 enzyme comprising a T1372S mutation. Exemplary mutations are set forth above. “A mutation that increases formation of 5-caC” means that the TET enzyme having the mutation produces more 5-caC than a TET enzyme that lacks the mutation but is otherwise identical. 5-caC production can be measured as described, e.g., in Liu et al., Nat Chem Biol 13:181-187 (2017) (see Online Methods section, TET reactions in vitro subsection, “driving” conditions). Any variants and / or mutants described in Liu et al. (2017) can be used in the disclosed methods as appropriate.
[0345] Provided herein is a method comprising contacting DNA contacting DNA with a mutant TET2 enzyme (e.g. comprising a V1900A, V1900C, V1900G, V1900I, V1900P, orT1372S mutation) to oxidize 5-methylcytosine (5mC) and / or 5-hydroxymethylcytosine (5hmC) present in the DNA to ...
Claims
1. Attorney Ref. No.: GH0230WO2.CLAIMS WHAT IS CLAIMED IS:
1. A method comprising:4.obtaining, by a computing system including one or more computing devices each including processing resources and memory, training data, the training data indicating an amount of nucleic acid molecules that overlap with individual genomic regions of a plurality of genomic regions, individual nucleic acid molecules satisfying a methylation criteria corresponding to amounts of methylated cytosine-guanine dinucleotides (CpGs) present in the individual nucleic acid molecules and the amount of nucleic acid molecules being derived from a plurality of first training samples;5.performing, by the computing system and based on the training data, a first training process for a first machine learning model having a number of transformer blocks to produce a first trained machine learning model that predicts tokens for the individual genomic regions of the plurality of genomic regions, the tokens indicate a number of nucleic acid molecules that (i) are derived from one or more additional samples and (ii) correspond to the individual genomic regions;6.determining, by the computing system, output activations of individual transformer blocks of the number of transformer blocks of the first trained machine learning model; and performing, by the computing system and based on the output activations of the individual transformer blocks of the number of transformer blocks of the first trained machine learning model, a second training process for a second machine learning model that predicts one or more classifications corresponding to one or more biological conditions being present in one or more subjects, the second training process being performed using sequence representations derived from second training samples.
2. The method of claim 1, comprising:8.obtaining, by the computing system, sequencing data derived from a number of samples, the sequencing data corresponding to nucleic acid molecules included in the number of samples; and9.determining, by the computing system and based on the sequencing data, quantitative measures for a number of genomic regions, individual quantitative measures corresponding to a number of the nucleic acid molecules corresponding to an individual genomic region of the number of genomic regions. Attorney Ref. No.: GH0230WO3. The method of claim 2, comprising:11.determining, by the computing system and based on the sequencing data, mutant allele fractions for the number of genomic regions; and12.determining, by the computing system, an additional quantitative measure for individual genomic regions of the number of genomic regions, the additional quantitative measure indicating a level of correlation between the individual quantitative measure and the mutant allele fraction for the individual genomic region.
4. The method of claim 3, comprising:14.determining, by the computing system, a first subset of the number of genomic regions having quantitative measures that are at least a first threshold value;15.determining, by the computing system, a second subset of the number of genomic regions having additional quantitative measures that are at least a second threshold value; and determining, by the computing system, the plurality of regions related to the training data by combining the first subset of the number of genomic regions and the second subset of the number of genomic regions.
5. The method of claim 2, wherein determining the quantitative measures includes: for individual genomic regions of the plurality of genomic regions, determining, by the computing system, a first number of the amount of nucleic acid molecules that correspond to the individual genomic regions;17.for individual control genomic regions, determining, by the computing system, a second number of the amount of nucleic acid molecules that correspond to the individual control genomic regions, wherein the individual control genomic regions include genomic regions having a minimum number of methylated cytosine-guanine dinucleotides in subjects in which a tumor is not present; and18.performing, by the computing system, a transformation of the first number of the amount of nucleic acid molecules with respect to the second number of the amount of nucleic acid molecules.
6. The method of claim 2, comprising:20.for individual genomic regions of the number of genomic regions, determining, by the computing system, a range of values of the quantitative measures that correspond to the individual genomic regions; and Attorney Ref. No.: GH0230WO21.determining, by the computing system, a subset of the values of the quantitative measures that correspond to individual partitions of a number of partitions corresponding to the individual genomic regions.
7. The method of claim 6, wherein the number of partitions are distributed such that individual partitions of the number of partitions correspond to a same number of the amount of nucleic acid molecules included in the training data.
8. The method of claim 6 or 7, comprising:24.for individual first training samples of the plurality of first training samples:25.for individual genomic regions of the plurality of genomic regions, determining, by the computing system, a partition of the number of partitions based on a number of the amount of nucleic acid molecules derived from the individual first training sample that correspond to the individual genomic region; and26.determining a number of tokens, individual tokens of the number of tokens corresponding to the partition of the number of partitions for the individual genomic region;27.wherein the training data includes the number of tokens for the plurality of genomic regions for the plurality of first training samples.
9. The method of any one of claims 1-8, wherein the second training process is performed with respect to labeled training data that includes first sequence representations derived from a first portion of the second training samples and second sequence representations derived from a second portion of the second training samples, the first portion of the second training samples being derived from first subjects in which a tumor is not detected and the second portion of the second training samples being derived from second subjects in which one or more cancer types or subtypes have been detected.
10. The method of any one of claims 1-9, wherein the one or more classifications include a first classification indicating that a biological condition is present in a subject and a second classification indicating that a biological condition is not present in a subject.Attorney Ref. No.: GH0230WO11. The method of any one of claims 1-9, wherein the one or more classifications include a first classification indicating a first cancer type and a second classification indicating a second cancer type.
12. The method of any one of claims 1-9, wherein the one or more classifications correspond to one or more tumor fraction values.
13. The method of any one of claims 1-9, wherein the one or more classifications correspond to a level of homology directed repair with respect to a test subject.
14. The method of any one of claims 1-9, comprising:34.obtaining, by the computing system, first test sequencing data derived from one or more first test samples obtained from a test subject at a first time;35.determining, by the computing system, a first classification for the test subject based on implementing the second machine learning model with respect to the first test sequencing data;36.obtaining, by the computing system, second test sequencing data derived from one or more second test samples obtained from the test subject at a second time;37.determining, by the computing system, a second classification for the test subject based on implementing the second machine learning model with respect to the second test sequencing data.
15. The method of any one of claims 14, determining an amount of progression of cancer or an amount of regression of cancer based on an amount of difference between the first classification and the second classification.
16. The method of claim 14 or 15, wherein the first time is before administering one or more treatments to the test subject and the second time is after administering the one or more treatments to the test subject.
17. The method of claim 16, comprising determining a level of effectiveness of the one or more treatments based on an amount of difference between the first classification and the second classification.Attorney Ref. No.: GH0230WO18. The method of claim 14, comprising determining an indication of minimum residual disease based on an amount of difference between the first classification and the second classification.
19. The method of any one of claims 1-9, wherein the one or more classifications correspond to a presence of one or more genomic mutations present in nucleic acid molecules derived from samples obtained from subjects.
20. The method of any one of claims 1-19, wherein the second machine learning model implements one or more statistical classification models or one or more machine learning classification models that are different from the number of transformer blocks of the first machine learning model.
21. The method of any one of claims 1-20, wherein:45.the number of transformer blocks of the first machine learning model are arranged in a sequence having a first transformer block and a last transformer block with output activations from an individual transformer block being provided as input activations to a next transformer block in the sequence;46.the number of transformer blocks include a first number of neurons;47.the second machine learning model includes an additional transformer block having input activations that correspond to output activations of a next to last transformer block of the sequence; and48.the additional transformer block includes a second number of neurons that is different from the first number of neurons.
22. The method of claim 21, wherein the first number of neurons corresponds to a number of tokens produced based on sequencing data obtained from one or more subjects and the second number of neurons corresponds to the one or more classifications.
23. The method of any one of claims 1-22, comprising:51.after performing the second training process:52.obtaining, by the computing system, test sequencing data derived from a test sample obtained from a test subject; Attorney Ref. No.: GH0230WO53.determining, by the computing system and based on the test sequencing data, quantitative measures for the plurality of genomic regions, individual quantitative measures indicating a number of nucleic acid molecules included in the test sample that (i) correspond to an individual genomic region of the plurality of genomic regions and {ii} satisfy the methylation criteria;54.determining, by the computing system and based on the quantitative measures, tokens for the test sample, individual tokens indicating a partition of a plurality of partitions for an individual genomic region of the plurality of genomic regions;55.providing, by the computing system, the tokens as input to the second machine learning model; and56.determining, by the computing system and based on implementing the second machine learning model with respect to the tokens, a classification of the one or more classifications for the test subject