Epigenetic profiling method of nucleotide residues in cell-free DNA
The enzymatic 'unmethylome' profiling method enables simultaneous and unbiased profiling of nucleotide residues and genetic mutations in polynucleotides, overcoming limitations of existing technologies by using a one-pot process that avoids PCR amplification and sequence bias, facilitating efficient analysis of low DNA concentrations for personalized medicine.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- TAGOMICS LTD
- Filing Date
- 2025-12-12
- Publication Date
- 2026-06-18
AI Technical Summary
Current methods for epigenetic profiling of nucleotide residues and genetic mutations in polynucleotides, particularly in low quantities of circulating cell-free DNA, are limited by DNA degradation, low conversion efficiencies, biased representation of genomic loci, and the inability to simultaneously profile mutations and residues without sequence amplification, which hinders personalized medicine applications.
An enzymatic 'unmethylome' profiling approach that derivatizes unmodified nucleotides for unbiased isolation and sequencing, allowing simultaneous profiling of nucleotide residues and genetic mutations in a one-pot process compatible with standard sequencing platforms, without the need for PCR amplification.
The method provides a non-destructive, reproducible, and unbiased profiling of nucleotide residues and genetic mutations, enabling efficient analysis of low DNA concentrations, suitable for single-cell analysis and diagnosis of diseases, with minimal sample loss and no sequence bias.
Smart Images

Figure IMGF000017_0001 
Figure IMGF000018_0001 
Figure IMGF000020_0001
Abstract
Description
[0001] Profiling Method
[0002] Field
[0003] The present application relates to methods of determining the status of nucleotide residues and the presence of genetic mutations in polynucleotides.
[0004] Introduction
[0005] Epigenetic modifications of polynucleotides, such as methylation and hydroxymethylation of cytosine, play an important role in determining the activity of a gene or a much more extended region of the genome. For example, methylation of DNA is critical in embryogenesis, early development and is known to change predictably in correlation with biological ageing of an organism. On the other hand, aberrant modification of DNA can be an important driver of tumourigenesis, and the broader dysregulation of genes is likely to play a key role in many diseases.
[0006] Despite the critical role of nucleotide, and in particular cytosine, modification in the regulation of gene expression, current methods for studying epigenetic modifications fundamentally limit the scope of current studies of the epigenome. Methods comprising bisulfite conversion of cytosine to uracil may be used for epigenetic analysis, but the treatment of DNA with bisulfite can lead to DNA degradation. This limits the application of bisulfite in samples where the DNA quantity is low, as is typical for circulating cell-free DNA (cfDNA) in blood. Bisulfite-free approaches have been demonstrated that employ pyridine borane base conversion or enzymatic deamination of unmethylated cytosine. However, such approaches can suffer from low conversion efficiencies, relative to bisulfite conversion, and are inherently focussed on the analysis of individual cytosine bases, necessitating (comparative) whole-genome sequencing for biomarker discovery. This invariably leads to reduction of the test to a panel of genomic loci for application in the clinic, which effectively limits the diagnosis to a population that is similar to that profiled during the biomarker discovery phase of the test development.
[0007] Enrichment-based approaches for epigenetic profiling allow cost-effective, whole genome profiling that is more suited to the challenges of delivering personalised medicine. Until recently, the amount of available cfDNA in a blood sample limited the application of either antibodies (“methyl-DNA Immunoprecipitation”, also referred to as “MeDIP-Seq”) or methyl-binding domain protein (referred to as “MBD-Seq”) in the analysis of cfDNA. A further limitation of approaches involving MeDIP-Seq or MBD- Seq is that these proteins preferentially bind heavily methylated regions of the genome, which leads to an over-representation of these loci in the sequencing experiment.
[0008] Moreover, current methods of epigenetic profiling do not permit the simultaneous profiling of genetic mutations in small samples of the polynucleotide of interest. This is particularly challenging for those approaches employing base conversion, which render the unambiguous identification of mutations against a background of converted bases challenging. Knowing the rates of genetic mutations associated with epigenetically modified regions of the genome may be advantageous, for example, for the diagnosis of disease. For example, it is broadly understood that the genomes of many cancers are both genetically mutated and hypomethylated, relative to a healthy genome.
[0009] There is, therefore, a need for improved profiling methods, that are capable of providing a profile of both the status of nucleotide residues and genetic mutations, in small quantities of the polynucleotide of interest, such as may be obtained from peripheral blood samples, that are capable of application to large portions of the polynucleotide sample without bias arising from differences in sequence, such as CpG density, or from sequence amplification such as by PCR-based methods, and that may be applied cost effectively at large scales for use as diagnostic tools and to inform personalised medicine approaches. The inventors have developed an epigenetic profiling technique which meets these requirements and overcomes various limitations of existing methods, including those discussed above.
[0010] The disclosed method is based on an enzymatic “unmethylome” profiling approach, in which unmodified nucleotides, such as unmodified CpG dinucleotides, in a DNA sample are derivatised in such a way that they can then be isolated without bias and subsequently sequenced. A focused or global profile of genetic mutations in the polynucleotide sample may also be obtained in parallel from the same sample. The disclosed method may be used, for example, to provide in parallel both a profile of the status of nucleotide residues and a genetic mutation profile in a target region of the genome or across the whole genome. The disclosed technique is an advantageously simple process in which the sample preparation steps (i.e. the steps prior to fractionation) can be performed in a one-pot approach, that can be integrated readily with standard high throughput sequencing platforms, and may be used with lower quantities of input DNA than has previously been possible, to generate genome-wide epigenetic and genetic mutation profiles.
[0011] The inventors have found that the disclosed method provides significant advantages over previous methods of determining the status of nucleotide residues and epigenetic analysis. The non-destructive nature of the disclosed method means approach may be used in parallel with other analytical approaches.
[0012] Moreover, the inventors have surprisingly found that amplification of the input sample, for example using a PCR-based method, is not required. Thus, the disclosed method preferably comprises an amplification-free approach, and in particular, preferably does not comprise the use of PCR to amplify the input sample. Since an amplification process is not a requirement of the disclosed method, problems arising from polynucleotide amplification, including for example, the introduction of sequence errors, GC bias, reduced library quality, etc., may advantageously be avoided.
[0013] The disclosed method combines a procedure of DNA library preparation for next generation sequencing and a method for labelling unmodified, such as epigenetically unmodified, nucleotides. Using the label, the DNA library is subsequently fractionated into modified and unmodified fractions. The unmodified fraction is then subdivided into first and second subsets. An epigenetic profile of the polynucleotide sample may be obtained by sequencing the first subset. A genetic mutation profile of the polynucleotide sample may be obtained in parallel by sequencing a pooled fraction formed by combining at least a portion of the second subset with at least a portion of the second (unlabelled i.e. unmodified) fraction. The pooled fraction formed in this way, by combining portions of the second subset and second fraction, comprises the complete nucleotide sequence of the polynucleotide sample, and thus when sequenced may provide a genetic mutation profile of the polynucleotide sample as a whole. Any of the subsets and fractions (such as the first subset of the first fraction, the second fraction, and / or the pooled fraction) maybe further adapted for subsequent analysis, such as by target enrichment for one or more genome regions of interest. The method advantageously minimises the number of DNA purification steps required and is highly efficient. As a result, the disclosed enzymatic platform can be applied at DNA concentrations that are compatible with single-cell analysis (picogram inputs). The input sample does not require amplification, and the approach has been found to be highly reproducible and unbiased, and maybe used as a platform for the diagnosis of disease and the identification of tissue of origin in a sample. The underlying chemistry requires no a priori assumptions to be made about the sample, making the platform ideally suited for the discovery of novel biomarkers of disease.
[0014] Central to the method is the production of a fractionated sequencing libraiy comprising first (labelled i.e. epigenetically unmodified) and second (unlabelled i.e. epigenetically modified) fractions, wherein the first fraction is subdivided into first and second subsets enabling the subsets to be separately further analysed. Most significantly, this approach enables at least a portion of the second subset of the first fraction to be combined with at least a portion of the second fraction to form a pooled fraction that has been found to provide surprising advantages for subsequent sequencing operations.
[0015] Thus, in a first aspect, there is provided a fractionated sequencing libraiy produced from a polynucleotide sample, wherein the sequencing library comprises:
[0016] (i) separate first and second subsets of polynucleotides that are derived from a first fraction of the polynucleotide sample, wherein the first fraction is enriched for polynucleotides comprising an affinity label, wherein the affinity label is bound to a tag that is bound site-specifically to a nucleotide residue in each polynucleotide; and (ii) a second fraction of polynucleotides that is enriched for polynucleotides lacking an affinity label.
[0017] As used herein, unless otherwise indicated, a “sequencing library” refers to a plurality of polynucleotides, each comprising a sequencing adaptor, such as a sequencing adaptor arranged for use in next generation sequencing.
[0018] In some embodiments, the polynucleotide sample has not been amplified, and thus the fractionated sequencing library may be an unamplified fractionated sequencing library. In some embodiments, the tag may have been applied by a methyltransferase enzyme using a methyltransferase cofactor analogue. The methyltransferase enzyme may be an enzyme that has been configured to modify a nucleotide residue in a target position to apply the tag to each unmodified nucleotide residue in a polynucleotide, wherein each unmodified nucleotide residue is unmodified in the target position. In some embodiments, the first subset of the first fraction and / or the second fraction may have undergone target enrichment for one or more genome regions of interest. Target enrichment may have been performed after fractionation. Any suitable method of target enrichment maybe used, including, for example:
[0019] (a) in-solution hybridization using oligonucleotide “bait” probes specific for genomic regions of interest; and / or
[0020] (b) a PCR-based enrichment method.
[0021] In some embodiments, the fractionated sequencing library may have been amplified.
[0022] Thus, in some embodiments, the fractionated sequencing library may be an amplified fractionated sequencing library. Preferably, however, the fractionated sequencing library has not been amplified.
[0023] In some embodiments, the polynucleotides of the sequencing library may comprise an indexing barcode. In some embodiments, the indexing barcode may comprise a unique dual nucleotide index, and / or a unique molecular identifier.
[0024] In some embodiments, the polynucleotide sample may comprise or substantially consist of, and / or may have been fragmented to comprise or substantially consist of, polynucleotides having a length between 10 and 500 bp. In some embodiments, the polynucleotides may substantially or predominantly have a length of 50-475 nucleotides, such as 100-450 nucleotides, 125-425 nucleotides, or 150-400 nucleotides.
[0025] In some embodiments, the polynucleotide sample may comprise or substantially consist of, and / or may have been fragmented to comprise or substantially consist of, polynucleotides having a length corresponding to the DNA sequencing read length. In such embodiments, the polynucleotides may substantially or predominantly have a length of 100-250 nucleotides, preferably 150-180 nucleotides.
[0026] As used herein, unless otherwise stated, the terms “fractionated”, “fractionating”, “fractionation”, and similar terms, in relation to the sequencing library refer to the separation of the polynucleotides of the sequencing library into different groups or “fractions” on the basis of the presence or absence of one or more specific features. In particular, polynucleotides maybe fractionated based on the presence or absence of an associated affinity label, thereby forming a first fraction that is enriched for polynucleotides comprising an affinity label and a second fraction that is enriched for polynucleotides that do not comprise an affinity label. As such, the fractionation of the polynucleotides may also be described and / or referred to as “enriching” and / or “enrichment” of the sequencing library for polynucleotides having or not having the one or more specific features. For example, a fraction of polynucleotides may be referred to as being “enriched” for labelled or unlabelled polynucleotides, as appropriate. Selection of polynucleotides having a specific feature from a mixture of polynucleotides results in the removed fraction being “enriched” for the specific feature, and the remainder being “enriched” for polynucleotides lacking the specific feature.
[0027] The terms “enriched”, “enriching”, and “enrichment” of polynucleotides as used herein, unless otherwise stated, refer to a polynucleotide concentration and / or proportion that is greater than the corresponding polynucleotide concentration and / or proportion in the initial (unenriched) sample. References to “enriching the labelled polynucleotide library” and similar terms refer to fractionating the sequencing library / polynucleotide sample into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label. Likewise, references to “enriching the labelled sequencing library” or “enriching the labelled DNA library” and similar terms refer to fractionating the sequencing library / polynucleotide / DNA sample into first and second fractions, wherein the first fraction is enriched for polynucleotides / DNA molecules comprising an affinity label, and wherein the second fraction is enriched for polynucleotides / DNA molecules lacking an affinity label.
[0028] As used herein, unless otherwise stated, the terms “label”, “labelled”, “labelling” and similar terms refer to the targeted binding (covalent or otherwise, either directly or indirectly) of a compound that facilitates selective enrichment of the targeted polynucleotides.
[0029] In some embodiments, the tag is not a fluorescent tag. In some embodiments, the label is not a fluorescent label. In some embodiments, neither the tag nor the label is fluorescent. Thus, in some embodiments, the label and / or tag does not consist of or comprise a fluorophore or fluorophore derivative that is capable of emitting light when excited, such as re-emitting light upon light excitation.
[0030] In some embodiments, the label may be suitable for use for enriching the labelled polynucleotides from a mixture comprising labelled and unlabelled polynucleotides.
[0031] In some embodiments, the label may be suitable for use directly for enriching the labelled polynucleotides. In other embodiments, a secondary compound that specifically binds the label, preferably with a high affinity, may be used for enrichment of the labelled polynucleotides.
[0032] In some embodiments, the label may comprise a tag from a cofactor analogue. The tag may be bound to a nucleotide residue that is unmodified in a target position. Preferably the process of preparing a sequencing library does not comprise combining the polynucleotides together to form an extended ligated polynucleotide for use, for example, in a sequencing method comprising nanopore technology.
[0033] As used herein, unless otherwise stated, the terms “site-specific”, “site-specifically”, and similar terms, refer to the application of a label to a target atom within a nucleotide residue having a particular configuration. In some embodiments, the particular configuration maybe the absence of a specific modification, such as a cytosine that is unmethylated in the C5 position, and the label may bind to an atom that would otherwise have been bound by the modification.
[0034] As used herein, unless otherwise stated, the term “subset”, and similar terms, refers to a part of a sample that has a composition that is representative of the composition of the sample as a whole. A subset may be divided or otherwise non-selectively separated from the remainder of the sample. Thus, two or more subsets of a sample may or may not be equal or substantially equal in size or quantity, but the proportions of constituent components within each subset, and / or the composition of each subset, are preferably substantially identical. References to a “portion” of a feature encompass both part of the feature and the whole feature. Any suitable method may be used to divide, separate, or otherwise produce first and second subsets from or of the first fraction. In some embodiments, the polynucleotide sample may be a DNA sample. In some embodiments, the polynucleotide sample may be a cfDNA sample. In a second aspect, there is provided a method for making a fractionated sequencing library from a polynucleotide sample, the method comprising:
[0035] (i) using a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position;
[0036] (ii) inactivating the methyltransferase;
[0037] (iii) preparing the polynucleotide sample into a sequencing library;
[0038] (iv) binding an affinity label to each tag;
[0039] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label; and
[0040] (vi) producing separate first and second subsets of the first fraction.
[0041] The fractionated sequencing library produced by the method of the second aspect may be a fractionated sequencing library of the first aspect.
[0042] In some embodiments, the method may further comprise amplifying the polynucleotides, to form an amplified fractionated sequencing library. Preferably, however, the method does not comprise amplifying the polynucleotides, and the fractionated sequencing library is preferably an unamplified fractionated sequencing library.
[0043] In some embodiments, the method may further comprise:
[0044] (vii) forming a pooled fraction comprising: (a) at least a portion the second subset of the first fraction, but not the first subset; and
[0045] (b) at least a portion the second fraction.
[0046] In some embodiments, the method may further comprise target enrichment for one or more genome regions in:
[0047] (i) the first subset of the first fraction; (ii) the second fraction; and / or
[0048] (iii) the pooled fraction.
[0049] Any suitable method of target enrichment may be used, including, for example:
[0050] (a) in-solution hybridization using oligonucleotide “bait” probes specific for genomic regions of interest; and / or
[0051] (b) a PCR-based enrichment method.
[0052] In preferred embodiments, the method does not comprise amplifying the polynucleotides before fractionation. In some embodiments, the polynucleotides may be amplified:
[0053] (a) after step (v) and before step (vi);
[0054] (b) after step (vi); and / or
[0055] (c) after step (vii). In some embodiments, the method may comprise the amplification of the polynucleotides after inactivation of the methyltransferase (i.e. after step (ii) or after step (iii)). Preferably, however, the method does not comprise the amplification of the polynucleotides after inactivation of the methyltransferase (i.e. after step (ii) or after step (iii)).
[0056] In some embodiments, the method may additionally or alternatively comprise the amplification of the polynucleotides of the sequencing library after binding of the affinity label (i.e. after step (iv)). Preferably, however, the method does not comprise the amplification of the polynucleotides of the sequencing library after binding of the affinity label (i.e. after step (iv)).
[0057] In some embodiments, the method may comprise the amplification of the polynucleotides of the first and / or second fraction (i.e. after step (v)). In some embodiments, the method may comprise the amplification of the polynucleotides of the separate first and / or second subsets of the first fraction (i.e. after step (v)).
[0058] The library preparation step (step (iii)) is performed before the fractionation step (step (v)). In some embodiments, steps (i)-(v) of the method are performed in the numerical sequence in ascending order (i.e. in sequence from step (i) to step (v)). In some embodiments, the library preparation step (step (iii)) may be performed before the affinity labelling and fractionation steps (i.e. before steps (iv) and (v)). In other embodiments, step (iii) maybe performed after step (iv), and before step (v). In some embodiments, one or more aspect of the library preparation step, such as end repair, A-tailing, and / or adapter ligation, maybe performed before step (i). In such embodiments, the remaining step(s) of library preparation maybe performed subsequently, such as after the inactivation of the methyltransferase (step (ii)), or after affinity labelling (step (iv)).
[0059] In some embodiments, the polynucleotide sample may be prepared into an affinity labelled sequencing library (steps (i)-(iv)) prior to fractionation.
[0060] In some embodiments, the sequencing library is fractionated (step v) prior producing separate first and second subsets of the first fraction (step vi).
[0061] In some embodiments, the method may be a “one pot” method, wherein all of the steps prior to fractionation (i.e. steps (i)-(iv)) are performed in a single container, thereby providing significant efficiencies in terms of time and reagents, and advantages in terms of automation. This approach has been found to maximise sensitivity, time, reagents and the overall yield of the polynucleotide enrichment.
[0062] In some embodiments, the method may comprise at most one sample purification step prior to fractionation (i.e. prior to step (v)).
[0063] In some embodiments, binding an affinity label to each tag may comprise adding an affinity label precursor directly into the sequencing library preparation mixture, without a washing step. In some embodiments, using a methyltransferase enzyme to apply a tag to each unmodified nucleotide residue may comprise the use of a methyltransferase cofactor analogue.
[0064] In some embodiments, the polynucleotide sample may comprise DNA, and the target position of the methyltransferase enzyme may be selected from the group consisting of the cytosine C5 position, the cytosine N4 position, and the adenine N6 position. In some embodiments, the affinity label may comprise biotin, and fractionating the sequencing library into first and second fractions may comprise fractionation using a capture agent comprising a biotin-binding protein.
[0065] In some embodiments, amplifying the polynucleotides may further comprise the incorporation of an indexing barcode into the amplified polynucleotides. In some embodiments, the indexing barcode may comprise a unique dual nucleotide index, and / or a unique molecular identifier.
[0066] In a third aspect, there is provided a fractionated sequencing library, wherein the fractionated sequencing library is obtained or obtainable from a polynucleotide sample by a method of the second aspect. The fractionated sequencing library may be an unamplified fractionated sequencing library.
[0067] Any of the statements of invention in relation to the first and / or second aspects may be applied to the fractionated sequencing library of the third aspect. Indeed, any of the statements of invention in relation to any aspect of the invention may be applied to any of the other aspects of the invention as appropriate.
[0068] In a fourth aspect, there is provided a method for determining the presence of a genetic mutation in a polynucleotide sample, the method comprising: (i) obtaining a fractionated sequencing library of the first or third aspects, or preparing a fractionated sequencing library using a method of the second aspect;
[0069] (ii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the polynucleotide sample; and (iii) sequencing the polynucleotides of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the polynucleotide sample.
[0070] As used herein, unless otherwise stated, the terms “genetic mutation” and “genetic sequence mutation” refer to any change in the nucleotide sequence relative to a control, native, or wild-type sequence. The control sequence may be, for example, a healthy genome or polynucleotide obtained from the same individual as the test sample.
[0071] In some embodiments, the method may comprise amplifying the polynucleotides before forming the pooled fraction. In such embodiments, the method may further comprise determining the proportion of the initial polynucleotide sample that is present in the first fraction and adding a corresponding quantity of the amplified second subset to at least a portion of the second fraction to form a pooled fraction that has a polynucleotide content that is representative of the initial polynucleotide sample.
[0072] In some embodiments, the method may further comprise determining the modification status of nucleotide residues in the polynucleotide sample. In such embodiments, the method may comprise:
[0073] (iv) sequencing the polynucleotides of the first subset of the first fraction, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample.
[0074] In some embodiments, the method further comprises enriching:
[0075] (i) the first subset of the first fraction; (ii) the second fraction; and / or
[0076] (iii) the pooled fraction, for one or more genome regions of interest prior to sequencing. Any suitable method for target enrichment maybe used to selectively isolate and / or amplify one or more specific genome regions of interest for sequencing analysis.
[0077] In some embodiments, the method further comprises amplification of the polynucleotides of the pooled fraction after target enrichment.
[0078] In some embodiments, target enrichment of the pooled fraction comprises: (a) in-solution hybridization using oligonucleotide “bait” probes specific for genomic regions of interest; and / or
[0079] (b) a PCR-based enrichment method.
[0080] In some embodiments in which the polynucleotide sample comprises DNA, the target position of the methyltransferase enzyme may consist of the cytosine C5 position. In such embodiments, the method may comprise: (i) using a methyltransferase enzyme configured to modify the cytosine C5 position of a CpG dinucleotide to apply a tag to each unmodified cytosine residue of the sample, wherein each unmodified cytosine residue is the cytosine of a CpG dinucleotide that is unmodified in the C5 position; (ii) inactivating the methyltransferase;
[0081] (iii) preparing the DNA sample into a sequencing library;
[0082] (iv) binding an affinity label to each tag;
[0083] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for DNA molecules comprising an affinity label, and wherein the second fraction is enriched for DNA molecules lacking an affinity label;
[0084] (vi) producing separate first and second subsets of the first fraction;
[0085] (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the DNA sample; and (viii) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample.
[0086] In some embodiments, the method may comprise amplifying the DNA molecules:
[0087] (a) after step (v) and before step (vi); and / or (b) after step (vi).
[0088] In some embodiments, the methyltransferase enzyme may be a C5 methyltransferase.
[0089] In such embodiments, the methyltransferase cofactor analogue maybe ETA-AdoHcy-
[0090] N3.
[0091] In some embodiments, the methyltransferase enzyme may consist of or comprise a variant or fragment of the wild type M.Mpel sequence. For example the methyltransferase enzyme may consist of or comprise at least 80% sequence identity to the wild type M.Mpel sequence having the NCBI accession number BAC44284.
[0092] The inventors have found that the disclosed method provides significant advantages over previous methods of epigenetic analysis. The non-destructive nature of the disclosed enrichment method means that multiple analytical approaches may be performed in parallel, such as for example, nucleotide sequencing and epigenetic analysis of the polynucleotide sample. In addition, the inventors have found that the disclosed enrichment method is highly efficient, allowing analysis of very low levels of input sample. Moreover, the inventors have identified that previous “unmethylome” profiling approaches introduce bias in relation to densely modified polynucleotides, and the disclosed method avoids this detrimental bias. As discussed herein, these advantages have been made possible by minimising the loss of sample, by performing various operations in specific sequences and combinations.
[0093] In a fifth aspect, there is provided a method for determining the modification status of nucleotide residues in a polynucleotide sample, the method comprising:
[0094] (i) obtaining a fractionated sequencing library of the first or third aspects, or preparing a fractionated sequencing library using a method of the second aspect; and
[0095] (ii) sequencing the polynucleotides of the first subset, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample.
[0096] Thus, in some embodiments, the method may be a method for determining the modification status of nucleotide residues in a polynucleotide sample, the method comprising: (i) using a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position;
[0097] (ii) inactivating the methyltransferase; (iii) preparing the polynucleotide sample into a sequencing library;
[0098] (iv) binding an affinity label to each tag;
[0099] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label; (vi) producing separate first and second subsets of the first fraction; and
[0100] (vii) sequencing the polynucleotides of the first subset, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample. In some embodiments, the method may comprise amplifying the polynucleotides:
[0101] (a) after step (v) and before step (vi); and / or (b) after step (vi).
[0102] In a sixth aspect, there is provided a method for determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide sample, the method comprising:
[0103] (i) using a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position; (ii) inactivating the methyltransferase;
[0104] (iii) preparing the polynucleotide sample into a sequencing library;
[0105] (iv) binding an affinity label to each tag;
[0106] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label;
[0107] (vi) producing separate first and second subsets of the first fraction;
[0108] (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the polynucleotide sample; (viii) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the polynucleotide sample; and
[0109] (ix) sequencing the polynucleotides of the first subset, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample.
[0110] In some embodiments, the method may comprise amplifying the polynucleotides:
[0111] (a) after step (v) and before step (vi); and / or
[0112] (b) after step (vi).
[0113] As used herein, unless otherwise stated, the terms “method of’ and “method for”, such as “method of determining” and “method for determining” are intended to be interpreted interchangeably, to encompass methods “suitable for” the described purpose. The “nucleotide modification status” and “modification status of nucleotide residues” as used herein, unless otherwise stated, refer to the presence (modified) or absence (unmodified) of any chemical modification in a target position on a nucleotide residue that may be catalysed by a methyltransferase enzyme. Thus, the disclosed method may be used to determine the modification status of any position within a nucleotide that may be chemically modified by a methyltransferase enzyme. Such methyltransferase catalysed modifications include, for example, the modification of cytosine (at the C5 or N4 position), adenine (at the N6 position). The nucleotide residues maybe cytosine residues and / or adenine residues.
[0114] Accordingly, the method may be a method for determining the modification status at target positions of cytosine and / or adenine residues in a polynucleotide.
[0115] The “modification status of cytosine residues” as used herein, unless otherwise stated, refers to the presence (modified) or absence (unmodified) of any methyltransferase catalysed chemical modification of cytosine.
[0116] In particular, the modification may comprise modification at the C5 position of cytosine. A cytosine residue may be understood to have the following structure:
[0117] In an unmodified cytosine residue R1maybe understood to be H. R1may also be referred to herein as the “C5 position”.
[0118] In a modified cytosine residue R1maybe anything other than H. Thus, “modified cytosine” refers to any cytosine residue that has been modified in any way at the C5 position, including, in particular, 5-methylcytosine (5-mC) and its oxidized products 5- hydroxymethylcytosine (5-hmC), 5-formylcytosine (5-fC) and 5-carboxylcytosine (5- caC). It maybe therefore appreciated that in modified cytosine, R1maybe methyl, CH20H, COH or COOH.
[0119] Alternatively, the modification may comprise modification at the N4 position of cytosine. Accordingly, a modified cytosine may have the following structure: where R1maybe anything other than H. Thus, “modified cytosine” refers to any cytosine residue that has been modified in anyway at the N4 position. It maybe therefore appreciated that in modified cytosine, R1may be methyl, CH20H, COH or COOH.
[0120] In some embodiments in which the polynucleotide sample comprises DNA, the method may further comprise: (a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or
[0121] (b) sequencing the DNA of the first subset, and using the sequencing information to determine the modification status at the cytosine C5 position of each CpG dinucleotide in the DNA sample.
[0122] In some embodiments in which the polynucleotide sample comprises DNA, the target position of the methyltransferase enzyme may consist of the cytosine C5 position. In such embodiments, the method may comprise: (i) using a methyltransferase enzyme configured to modify the cytosine C5 position of a CpG dinucleotide to apply a tag to each unmodified cytosine residue of the sample, wherein each unmodified cytosine residue is the cytosine of a CpG dinucleotide that is unmodified in the C5 position;
[0123] (ii) inactivating the methyltransferase; (iii) preparing the DNA sample into a sequencing library;
[0124] (iv) binding an affinity label to each tag;
[0125] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for DNA molecules comprising an affinity label, and wherein the second fraction is enriched for DNA molecules lacking an affinity label; (vi) producing separate first and second subsets of the first fraction;
[0126] (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the DNA sample;
[0127] (viii)(a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or (b) sequencing the DNA molecules of the first subset, and using the sequencing information to determine the modification status at the cytosine C5 position of each CpG dinucleotide in the DNA sample.
[0128] In some embodiments, the method may comprise amplifying the polynucleotides: (a) after step (v) and before step (vi); and / or
[0129] (b) after step (vi).
[0130] In some embodiments in which the polynucleotide sample comprises DNA, the method may further comprise: (a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or
[0131] (b) sequencing the DNA of the first subset, and using the sequencing information to determine the modification status at the cytosine N4 position of each CpG dinucleotide in the DNA sample.
[0132] In some embodiments in which the polynucleotide sample comprises DNA, the target position of the methyltransferase enzyme may consist of the cytosine N4 position. In such embodiments, the method may comprise: (i) using a methyltransferase enzyme configured to modify the cytosine N4 position of a CpG dinucleotide to apply a tag to each unmodified cytosine residue of the sample, wherein each unmodified cytosine residue is the cytosine of a CpG dinucleotide that is unmodified in the N4 position;
[0133] (ii) inactivating the methyltransferase; (iii) preparing the DNA sample into a sequencing library;
[0134] (iv) binding an affinity label to each tag;
[0135] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for DNA molecules comprising an affinity label, and wherein the second fraction is enriched for DNA molecules lacking an affinity label; (vi) producing separate first and second subsets of the first fraction;
[0136] (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the DNA sample; and
[0137] (viii)(a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or
[0138] (b) sequencing the DNA molecules of the first subset, and using the sequencing information to determine the modification status at the cytosine N4 position of each CpG dinucleotide in the DNA sample. In some embodiments, the method may comprise amplifying the polynucleotides:
[0139] (a) after step (v) and before step (vi); and / or
[0140] (b) after step (vi).
[0141] The “modification status of adenine residues” as used herein, unless otherwise stated, refers to the presence (modified) or absence (unmodified) of any chemical modification at the N6 position of adenine. An adenine residue maybe understood to have the following structure:
[0142] In an unmodified adenine residue R1maybe understood to be H.
[0143] In a modified adenine residue R1maybe anything other than H. Thus, “modified adenine” refers to any adenine residue that has been modified in anyway at the N6 position, including, in particular,6-methyladenine (m6A). It maybe therefore appreciated that in modified adenine, R1maybe methyl, CH20H, COH or COOH.
[0144] In some embodiments in which the polynucleotide sample comprises DNA, the method may further comprise:
[0145] (a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or
[0146] (b) sequencing the DNA of the first subset, and using the sequencing information to determine the modification status at the adenine N6 position in the DNA sample. In some embodiments in which the polynucleotide sample comprises DNA, the target position of the methyltransferase enzyme may consist of the adenine N6 position. In such embodiments, the method may comprise: (i) using a methyltransferase enzyme configured to modify the adenine N6 position to apply a tag to each unmodified adenine residue of the sample, wherein each unmodified adenine residue is unmodified in the N6 position;
[0147] (ii) inactivating the methyltransferase;
[0148] (iii) preparing the DNA sample into a sequencing library; (iv) binding an affinity label to each tag;
[0149] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for DNA molecules comprising an affinity label, and wherein the second fraction is enriched for DNA molecules lacking an affinity label;
[0150] (vi) producing separate first and second subsets of the first fraction; (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the DNA sample; and
[0151] (viii)(a) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample; and / or
[0152] (b) sequencing the DNA molecules of the first subset, and using the sequencing information to determine the modification status at the adenine N6 position in the DNA sample. In some embodiments, the method may comprise amplifying the polynucleotides:
[0153] (a) after step (v) and before step (vi); and / or
[0154] (b) after step (vi).
[0155] As used herein, “unmodified” and “unmethylated” refer to all nucleotides that are unmodified in any way (such as methylated, hydroxymethylated, carboxylated, acylated).
[0156] In some embodiments, the method comprises the detection of unmodified nucleotide residues in a polynucleotide. In some embodiments, the method comprises the detection of a genetic mutation in the polynucleotide. In some embodiments, the method comprises the detection of unmodified nucleotide residues, and a genetic mutation, in the polynucleotide.
[0157] In embodiments in which the method comprises determining, for a plurality of nucleotides, the presence (modified) or absence (unmodified) of any chemical modification catalysed by a methyltransferase, “modification status” may also be referred to as the “profile”. Thus, the disclosed method maybe used to determine a profile of methyltransferase catalysed modifications within a polynucleotide sample. The method may comprise the detection of unmodified cytosine residues in CpG dinucleotides of a DNA sample. The terms “CpG”, “CpG site”, and “CpG dinucleotide”, are used interchangeably herein to refer to a cytosine-phosphate-guanine sequence in a 5’ to 3’ direction in the backbone of a nucleic acid. The terms “CpG modification status” and “modification status” as used interchangeably herein, unless otherwise stated, refer to the presence (modified) or absence (unmodified) of any chemical modification at the C5 position of cytosine within one or a plurality of CpG dinucleotides. In embodiments in which the method comprises determining, for a plurality of CpG dinucleotides, the presence (modified) or absence (unmodified) of any chemical modification at the C5 position of each of the plurality of cytosines, the “CpG modification status” and “modification status” may also be referred to as the “profile”. In embodiments in which the profile corresponds to the entire genome, the profile may be referred to as the “unmethylome profile” or “unmethylome”.
[0158] The terms “genetic profile”, and “genetic mutation profile” as used herein, unless otherwise indicated, may be used to refer to the presence or absence of a genetic mutation at each position within the polynucleotide sequence of interest. Thus, the disclosed method may be used to determine a profile of genetic mutations within a polynucleotide sample.
[0159] A profile may comprise both the modification status of nucleotide residues and the genetic profile of a polynucleotide from the subject. The polynucleotide maybe a DNA sample, or maybe a mixed sample, comprising DNA and RNA.
[0160] The sample may be a DNA sample comprising an epigenome. The terms “epigenome” and “epigenetic” as used herein, unless otherwise specified, refer to the chemical modification of a polynucleotide or genome in such a way that gene expression is regulated. Thus, the method may be a method for determining the epigenetic profile and / or genetic mutation profile of a genomic DNA sample, the method further comprising determining the epigenetic profile based on the modification status of nucleotide residues in the sample. For example, the method may comprise determining the epigenetic profile based on the modification status of cytosine residues in CpG dinucleotides of the sample, i.e. the CpG modification status.
[0161] The method may be a method for analysing a polynucleotide sample, such as a DNA sample, from a subject.
[0162] Sample
[0163] The sample for use in the disclosed method may be obtained from any type of cell or tissue. For example, the sample maybe obtained from tissue, blood, plasma, serum, urine, saliva, stool, cerebrospinal fluid, buccal swab, pleural tap, etc.. The sample may be obtained from tissue. The sample may be obtained from blood. The sample may be a DNA sample. The DNA sample may be a cfDNA sample, which may comprise ctDNA. For example, the DNA sample maybe a cfDNA sample from peripheral blood. A “cell- free” sample as used herein, refers to nucleic acids not contained within or otherwise bound to a cell or, remaining in a sample following the removal of intact cells. Cell-free nucleic acids can include, for example, all non-encapsulated nucleic acids sourced from a bodily fluid (e.g., blood, plasma, serum, cerebrospinal fluid, etc.) from a subject. The cfDNA may be released into bodily fluid through secretion or a cell death process. The cfDNA may comprise DNA released into bodily fluid from cancer cells, and may be referred to as comprising circulating tumor DNA (ctDNA). The cfDNA may be released from healthy cells.
[0164] Methods of preparing samples for use in the disclosed method, such as DNA samples, comprising, for example, extracting and purifying nucleic acids such as DNA from cells or tissues, will be known to the skilled person. Any method that is suitable for preparing a polynucleotide sample for analysis, such as sequencing, maybe used. Surprisingly and advantageously, the inventors have found that due to the sensitivity of the disclosed method, amplification of the input sample, for example using a PCR- based method, is not required. As a result, problems arising from polynucleotide amplification, including for example, the introduction of sequence errors, GC bias, reduced library quality, etc., may advantageously be avoided.
[0165] The sample may comprise fragmented DNA. Fragmentation may be performed using any method used in the analysis of DNA, such as any fragmentation method used in the preparation of a DNA sample for genetic sequencing. For example, the DNA may be fragmented enzymatically, chemically by acoustic shearing, mechanical shearing (example, French pressure cells), sonicating, hydrodynamic shearing or chemically (for example, heat and divalent metal cation). In some implementations of the disclosed method, fragmentation of the DNA sample is not required. In such embodiments, the method does not include fragmentation of the DNA sample. For example, in embodiments in which the DNA sample is degraded or fragmented, such as when cfDNA is extracted from blood, the DNA sample may be used directly in the disclosed method. Typically, cfDNA extracted from blood substantially comprises DNA fragment sizes of 50-300 base pairs (bp) in length.
[0166] The method may comprise the selection of polynucleotides, such as DNA fragments, of a desired length. Thus, in some embodiments, the method may comprise, before the use of a methyltransferase (step 1), a step of fragmenting the polynucleotide and / or selecting polynucleotides of a desired length.
[0167] The method may comprise the use of polynucleotides, such as DNA fragments, substantially or predominantly having a length between 10 and 500 bp, such as between 30 and 400 bp, and preferably between 50 and 300 bp in length. In some embodiments, the method may comprise the use of polynucleotides, such as DNA fragments, substantially or predominantly having a length in the region of between too and 250 bp, preferably between about 150 and 180 bp, to match the sample to the DNA sequencing read length. The method may comprise the use of polynucleotides corresponding to an amount of DNA in the range of about 1 fg to about 1 pg, such as about 10 fg to about too ng, about too fg to about 10 ng, about 1 pg to about 1 ng. The sample may comprise a quantity of DNA in the picogram range. The sample may comprise less than ipg of DNA, such as less than 5OOng, less than toong, or less than tong of DNA. Preferably, the sample comprises between ing and loong of DNA.
[0168] Derivatisation of DNA - Tag
[0169] The polynucleotide may be derivatised using a methyltransferase enzyme to apply a tag to unmodified nucleotide residues. For example, the polynucleotide may be derivatised using an appropriate methyltransferase enzyme to apply a tag to specific nucleotide residues that are unmodified in target positions. Thus, to determine the modification status of a specific nucleotide residue at a specific target position, a methyltransferase enzyme may be used that is configured to apply a tag to the target nucleotide residue in the target position.
[0170] In some embodiments, the same tag may be used to derivatise different unmodified nucleotides and / or different target positions. In other embodiments, different tags may be used to derivatise different unmodified nucleotides and / or different target positions.
[0171] In some embodiments, the polynucleotide maybe derivatised with different tags (e.g. on different unmodified nucleotides) sequentially, for example, with the inactivation of the first methyltransferase before the addition of a second, different methyltransferase.
[0172] The use of a plurality of different tags advantageously allows the tags to be independently functionalised, for example, to provide selective enrichment / fractionation.
[0173] As used herein, unless otherwise indicated, references to the “target position” in which an unmodified nucleotide residue is unmodified refer to a specific position within the chemical structure of the nucleotide. The tag may be applied in the cytosine C5 position. Accordingly, the fragmented DNA sample will comprise tagged residues. A tagged cytosine residue may be understood to have the following structure: wherein R2is the tag.
[0174] The tag maybe applied in the N4 position in cytosine. Accordingly, the fragmented DNA sample will comprise tagged residues. A tagged cytosine residue maybe understood to have the following structure: wherein R2is the tag.
[0175] Similarly, a tag maybe added at the N6 position of adenine, the N2 or N7 position of guanine or at the 2’-0H position of ribose. In some embodiments, a tag at the 2’-0H position of ribose is a tag at the 2’-0H position of a terminal ribose.
[0176] Accordingly, a tagged adenine residue may be understood to have the following structure: wherein R2is the tag.
[0177] The term “tag”, which may also be referred to as a “linker”, a “functional linker” or “DNA tag”, as used herein, unless otherwise specified, refers to a reactive moiety that is applied site-specifically to the polynucleotide, such as fragmented DNA. Polynucleotides that have been tagged in this way may be referred to as “derivatised”. The disclosure may comprise the use of a methyltransferase cofactor analogue, such as a synthetic methyltransferase cofactor analogue, comprising the tag and a methyltransferase-binding moiety. Thus, the method may comprise the use of a methyltransferase enzyme to catalyse the transfer of the tag from the methyltransferase cofactor analog to an unmodified nucleotide residue, such as to the C5 position of a cytosine base of an unmodified CpG dinucleotide, in a polynucleotide sample. The presence of a modification, such as a methyl group or other chemical modification of the nucleotide residue, such as in the C5 position within a CpG dinucleotide, prevents the transfer of the tag. Thus, in the disclosed method, only nucleotides, such as CpG dinucleotides, that are unmodified (such as unmethylated) in this position may be labelled with a tag.
[0178] The methyltransferase cofactor maybe an ion of formula (I): wherein, X is S or Se;
[0179] L1is -CH2- or -CH2CH2-; R2is the tag;
[0180] R3 and R4are independently H or an optionally substituted C1-6 alkyl an optionally substituted C2-6 alkenyl or an optionally substituted C2-6 alkynyl; or R3 and R4together with the nitrogen to which they are attached, form an optionally substituted 5- or 6- membered heterocyclyl ring; and Rs is NH2, NHBOC or H; or a salt, solvate or tautomer thereof.
[0181] The ion of formula (I) may be provided together with a counterion. The counterion maybe an organic or inorganic anion carrying one or more negative charges. The counterion may be formate or acetate. R2maybe -CH2-U-[L3]m-[HM]n-[L2]p-[R6]q, wherein: m, n, p and q are each independently selected from o and 1;
[0182] L2is a linker;
[0183] HM is a hydrolysable moiety; L3is a linker;
[0184] U is an unsaturated group selected from an alkene, an alkyne, an aromatic group (e.g. aryl), a carbonyl group, SO and S02;
[0185] R6is a heavy atom or a heavy atom cluster suitable for phasing of X-ray diffraction data, a radioactive or stable rare isotope, a fluorophore, a fluorescence quencher, an affinity tag, a crosslinking agent, a nucleic acid cleaving reagent, a spin label, a chromophore, a protein, peptide or amino acid which may optionally be modified a nucleotide, nucleoside or nucleic acid which may optionally be modified, a carbohydrate, a lipid, a transfection reagent, an intercalating agent, a nanoparticle or bead, or a functional group, wherein the functional group is selected from the group consisting of: an amino group (including a protected amino), a thiol group, a 1,2-diol group, a hydrazino group, a hydroxyamino group, a haloacetamide group, a maleimide group, a cyanide group, a cyclic hydrocarbon (such as a bridged cyclic hydrocarbon (e.g. norbornene) or a cycloalkyl group (e.g. a C3-6 cycloalkyl), a halo group (e.g. -F, -Cl, -Br, -I), an aldehyde group, a ketone group, a 1,2-aminothiol group, a azido group, an isothiocyanate or thiocyanate group, an alkene group, such as a terminal alkene, an alkyne group, such as a terminal alkyne group, a 1,3-diene function, a dienophilic function (e.g. an activated carbon-carbon double bond), an arylhalide group, an arylboronic acid group, a terminal haloalkyne group, a terminal silylalkyne group, -N=C=O; -N=C=S, -0-C(0)NH2, a protected amino, a group comprising a sterically strained alkyne or alkene (such as norbornene or DBCO), a nitrone, a tetrazine, a tetrazole, and 1,2-aminothiol group.
[0186] In embodiments where R4is an optionally substituted Ci-4alkyl an optionally substituted C2.4alkenyl or an optionally substituted C2-4alkynyl, the alkyl, alkenyl or alkynyl may be unsubstituted or substituted with one or more substituents selected from the group consisting of: -NR7R8; -OH; -SH; -CN; -C(O)OR7; -C(O)R7;
[0187] C(O)NR7R8; N3; and halo, wherein R7and R8are independently H or a Ci-4alkyl. Halo may be F, Cl, Br or I. Similarly, in embodiments where R3and R4together with the nitrogen to which they are attached, form an optionally substituted 5- or 6-membered heterocyclyl ring, the 5- or 6-membered heterocyclyl ring may be unsubstituted, or substituted with one or more substituents selected from the group consisting of: -NR7R8; -OH; -SH; -CN; - C(O)OR7; -C(O)R7; C(O)NR7R8; N3; and halo, wherein R7and R8are independently H or a C1-4 alkyl. Halo may be F, Cl, Br or I.
[0188] Synthetic methyltransferase cofactors are described in more detail in
[0189] PCT / GB2022 / 052438, EP3186266B1 and US8008007B2. It maybe appreciated that preferred embodiments of the X, L1, R2, R3 and R4groups in the compound of formula (I) may be as defined for the equivalent groups in these applications. X maybe S.
[0190] L1may be -CH2CH2-.
[0191] R3maybe H. Alternatively, R3maybe an optionally substituted C1-4 alkyl an optionally substituted C2.4alkenyl or an optionally substituted C2-4alkynyl, more preferably an optionally substituted methyl or an optionally substituted ethyl. The alkyl, alkenyl or alkynyl may be unsubstituted or substituted with an OH. Accordingly, R3may be - CH2CH20H. R4maybe H.
[0192] R5may be NH2.
[0193] In some embodiments, q is 1. In some embodiments, R6is -N3. p may be 1.
[0194] L2maybe a linker comprising a backbone of between 1 and 50 atoms, between 2 and 40 atoms, between 3 and 30 atoms, between 4 and 20 or between 5 and 15 atoms. The backbone maybe made up of carbon, oxygen and / or nitrogen atoms. In embodiments where the linker comprises a cyclic group, the backbone may be understood to consist of the atoms which define the shortest possible route between the two ends of the linker group. In some embodiments, L2comprises between 1 and 5 groups selected from an optionally substituted hydrocarbon, an optionally substituted polyether chain, an arylene moiety and a (C=O)NH group. The hydrocarbon may be an optionally substituted alkylene, preferably an optionally substituted C1-10 alkylene and more preferably a Ci-5alkylene. The optionally substituted polyether chain may be an optionally substituted polyethylene glycol chain. The polyethylene glycol chain may comprise up to 15 monomers, up to 10 monomers or up to 5 monomers of ethylene glycol. In some embodiments, the polyethylene glycol chain consists of between 1 and 5 or between 2 and 3 monomers of ethylene glycol. The arylene moiety may be a CeH4phenylene ring.
[0195] Accordingly, in some embodiments, L2maybe: wherein w is an integer from between 1 and 15, e.g. between 2 and 10 or between 3 and 5. In some embodiments, w is 2 or 3.
[0196] Alternatively, in some embodiments, p is o. In some embodiments, n is 1. , wherein Rxis hydrogen, deuterium or a Ci-4alkyl. The Ci-4alkyl maybe methyl.
[0197] The hydrolysable moiety maybe a Schiff base, for example, an imine moiety, an oxime moiety and / or a hydrazone moiety.
[0198] In some embodiments, the hydrolysable moiety comprises a disulphide (S-S) bond.
[0199] In some embodiments, the hydrolysable moiety is
[0200] In some embodiments, n is o.
[0201] In some embodiments, m is 1. L3 may be a linker comprising a linear chain of from 1 to 20, from 2 to 15, from 3 to 10 or from 4 to 9 atoms. The atoms maybe carbon, oxygen and / or nitrogen atoms). The linker maybe substituted or unsubstituted. In some embodiments, L3comprises an optionally substituted hydrocarbon (e.g. an alkyl) chain. In some embodiments, L3comprises an optionally substituted linear C1-10 alkyl chain, e.g. an optionally substituted C2-s or an optionally substituted C4-6 alkyl chain. In some embodiments the alkyl chain is unsubstituted. In some embodiments the alkyl chain is substituted. In some embodiments, L3is a linear, unsubstituted C2, C3or C4alkyl chain.
[0202] In some embodiments,
[0203] Accordingly, in some embodiments, R2is . In alternative embodiments,
[0204] In some embodiments, the synthetic methyltransferase cofactor maybe:
[0205] , where R is H.
[0206] The above compound maybe called ETA-AdoHcy-N3
[0207] Derivatisation ofDNA - Methyltransferase
[0208] The methyltransferase may be any methyltransferase that is capable of using S- adenosyl methionine as a cofactor. Thus, the methyltransferase may be an S- adenosylmethionine-dependent methyltransferase, such as an S-adenosyl-L- methionine-dependent methyltransferase.
[0209] In some embodiments, the methyltransferase may be a cytosine-5 (C5) methyltransferase, such as a bacterial cytosine C5 methyltransferase. In some embodiments, the methyltransferase may be an adenine methyltransferase, such as a bacterial adenine methyltransferase. For example, the methyltransferase may be M.TaqI, which is a DNA adenine methyltransferase.
[0210] In some embodiments, the methyltransferase may be a methyltransferase from Mycoplasma.
[0211] In some embodiments, the methyltransferase may be a constitutively active methyltransferase . In some embodiments, the methyltransferase may be one of the enzymes described in US 2017 / 0283453.
[0212] In some embodiments, the methyltransferase maybe M.Mpel, M.Hhal, M.SssI, M.AccII, M.MspI or M.TaqI. The methyltransferase may be an active mutant, variant, and / or fragment of M.Mpel, M.Hhal, M.SssI, M.AccII, M.MspI or M.TaqI. M.Mpel has been found to be particularly advantageous for use in the disclosed method, in part due to being particularly non-selective in terms of target locus. Thus, the methyltransferase maybe M.Mpel or an active mutant, variant, and / or fragment thereof.
[0213] Although methyltransferase enzymes share a relatively low level of sequence similarity, they do share a highly conserved structural fold. This conserved fold is known as the Rossmann fold and comprises a series of beta strand and alpha helical segments, in which the beta strands are hydrogen bonded to form a beta-sheet.
[0214] The cofactor binding pocket of the methyltransferase enzyme may be modified within the Rossman fold, for example, by substitution of one or more amino acids, to improve the suitability of the enzyme for use in the disclosed method, such as, for example, by improving cofactor compatibility. For example, one or more amino acids within the Rossman fold of the methyltransferase enzyme may be substituted, for example, to reduce or relieve potential steric interaction with the cofactor analogue.
[0215] In some embodiments, the methyltransferase may be modified such that an amino acid having a relatively large side chain, such as, for example, glutamine or asparagine may be substituted for an amino acid comprising a shorter side chain, such as, for example, alanine. The use of a methyltransferase enzyme that has been modified in this way may be particularly desirable when larger cofactor analogues are used, such as cofactor analogues comprising transferrable groups with longer alkyl-chains than those with shorter chains.
[0216] In some embodiments, the methyltransferase may be any bacterial cytosine C5 methyltransferase enzyme comprising one or more, such as 2, 3, 4, 5, 6, or 7, amino acid substitutions in the Rossman fold. In some embodiments, the methyltransferase may be any bacterial cytosine C5 methyltransferase enzyme comprising an amino acid substitution in the position of the amino acid residue of the Rossman fold corresponding to the residue that is Gln82, Tyr254, and / or Asn3O4 in the wild type sequence of the M.Hhal methyltransferase (i.e. the sequence having the NCBI accession number P05102). In some embodiments, the methyltransferase may be any bacterial cytosine C5 methyltransferase enzyme comprising an alanine residue in the position of the amino acid residue of the Rossman fold corresponding to the residue that is Gln82, Tyr254, and / or Asn3O4 in the wild type sequence of the M.Hhal methyltransferase (i.e. the sequence having the NCBI accession number P05102).
[0217] In some embodiments, the methyltransferase maybe an M.Mpel methyltransferase.
[0218] The M.Mpel methyltransferase enzyme has been found to be particularly advantageous for use in the disclosed method due to non-selectively targeting any and all CpG dinucleotides for modification.
[0219] In some embodiments, the methyltransferase maybe, or may comprise, a variant, and / or fragment of the wild type M.Mpel sequence, which is defined as the sequence having the NCBI accession number BAC44284.
[0220] In some embodiments, the methyltransferase maybe, or may comprise, a variant, and / or fragment of the wild type M.Mpel sequence, comprising at least 80% sequence identity, such as at least 85%, 90%, or 95% sequence identity to the wild type M.Mpel sequence.
[0221] In some embodiments, the methyltransferase may comprise one or more, such as 2, 3, 4, 5, 6 or 7 amino acid substitutions relative to the wild type M.Mpel sequence having the NCBI accession number BAC44284. In some embodiments, the use of the methyltransferase enzyme to apply the tag to unmodified nucleotides, such as unmodified CpG dinucleotides, may be carried out under conditions which enable the methyltransferase to transfer the tag from the methyltransferase cofactor analogue to the target DNA. In some embodiments, the reaction mixture may be incubated at a temperature of from 10 to 6o°C, from 20 to 5O°C, or from 30 to 4O°C. Preferably, the reaction mixture may be incubated at a temperature of about 37°C.
[0222] In some embodiments, the incubation may be performed for a time sufficient to enable transfer of the tag to all of the available unmodified nucleotides, such as unmodified
[0223] CpG dinucleotides, in the fragmented DNA sample. The incubation may be performed for a period of 5 minutes to 5 hours, 10 minutes to 4 hours, 15 minutes to 3 hours, 30 minutes to 2 hours, or 40 to 90 minutes. Preferably the incubation is performed for a period of about 1 hour. In some embodiments, the incubation may be performed in a suitable buffer at a pH that is selected based on the methyltransferase that is being used. For example, the pH maybe between 7.5 and 8.5, such as between 7.8 and 8.2, or about pH 8.
[0224] Methyltransferase inactivation After an appropriate incubation to label the unmodified nucleotides, such as unmodified CpG dinucleotides, in the sample with a tag, the presence of methyltransferase in the subsequent processing of the sample has been found to reduce the efficiency of the method. Thus, the method comprises the inactivation of the methyltransferase enzyme.
[0225] In some embodiments, methyltransferase enzymes have been found to bind tightly to DNA, thereby inhibiting downstream processing of the DNA. Methods comprising removal of the methyltransferase or purification of the sample have been found to reduce the efficiency of the process due to the additional time and reagents required and due to the loss of sample.
[0226] Thus, in some embodiments, the methyltransferase enzyme may be inactivated in the reaction mixture. The inactivation of the methyltransferase in this way has surprisingly been found to provide significant processing efficiencies in the labelling method.
[0227] The terms “inactivated” and “inactivation” as used herein, unless otherwise specified, refer to any alteration in the structure and / or function of the methyltransferase that prevents further activity of the methyltransferase on the target polynucleotide. Thus, the terms “inactive” and “inactivated” as used herein, unless otherwise specified, refer to an enzyme that has less than 10%, such as less than 5%, less than 2%, or preferably less than 1% of its maximum activity.
[0228] Alterations in the structure and / or function of the methyltransferase that prevent further activity of the methyltransferase on the target polynucleotide may include, for example, denaturation, modification, inhibition, and / or fragmentation of the methyltransferase. Thus, in some embodiments, inactivation of the methyltransferase may comprise denaturation of the methyltransferase. In some embodiments, inactivation of the methyltransferase may comprise modification of the methyltransferase. In some embodiments, inactivation of the methyltransferase may comprise inhibition of the methyltransferase. In some embodiments, inactivation of the methyltransferase may comprise fragmentation of the methyltransferase.
[0229] The methyltransferase may be inactivated by any suitable method. Suitable methods include changing the environmental conditions of the methyltransferase, and targeted inactivation of the methyltransferase.
[0230] Changing the environmental conditions may consist of or comprise, for example, changing the temperature and / or pH of the reaction mixture.
[0231] Thus, in some embodiments, inactivation of the methyltransferase may comprise incubation of the reaction mixture at a temperature of from 55 to 85°C, from 60 to
[0232] 8o°C, or from 65 to 75°C. In some embodiments, inactivation of the methyltransferase may comprise incubation of the reaction mixture at a temperature of from 55 to 65°C, such as at or about 6o°C, or from 75 to 85°C, such as at or about 8o°C. In some embodiments, the inactivation of the methyltransferase may comprise incubation at an elevated temperature for a period of 5 minutes to 1 hour, or 10-30 minutes. Preferably inactivation of the methyltransferase may comprise incubation at an elevated temperature for about 15 minutes. Targeted inactivation of the methyltransferase may comprise the addition of an agent to alter the structure and / or function of the methyltransferase. Such an agent may comprise, for example, a methyltransferase inhibitor. Any suitable methyltransferase inhibitor may be used, including, for example, 5-azacitidine, decitabine, clofarabine, arsenic trioxide, guadecitabine, RX-3117, 5-fluoro-2’-deoxycytidine, 5,6-dihydro-5- azacytidine, cladribine, fludarabine, fazarabine, procaine, EGCG, hydralazine, genistein, equol, curcumin, disulfiram, resveratrol, and / or caffeic acid. The methyltransferase inhibitor may be a S-Adenyl-l-methionine (SAM) analogue, such as sinefungin or S-adenosyl-l-homocysteine (SAH). In some embodiments, the methyltransferase may be inactivated in the reaction mixture after an appropriate incubation to label the unmodified nucleotides, such as unmodified CpG dinucleotides, with a tag, thereby terminating the derivatisation reaction. Advantageously, the presence of the inactivated methyltransferase has not been found to be detrimental to subsequent processing. On the contrary, the inactivation of the methyltransferase enzyme in the reaction mixture this way, rather than by removal or dilution, has been found to provide increased efficiencies and significantly improved yields in subsequent steps of the process.
[0233] The inactivation of the methyltransferase may, therefore, provide significant advantages by removing the requirement for purification of the polynucleotide at this stage, and permitting the efficient combination of the methyltransferase and library preparation processes in a single reaction mixture. These efficiency advantages are shown in the Examples.
[0234] Library preparation After inactivation of the methyltransferase, the polynucleotide sample may be modified into a form that is compatible for high throughput sequencing. This process may be referred to as “preparing a sequencing library” or “library preparation”.
[0235] As used herein, unless otherwise indicated, a “sequencing library” refers to a plurality of polynucleotides, each comprising a sequencing adaptor, such as a sequencing adaptor arranged for use in next generation sequencing. Accordingly, “preparing a polynucleotide sample into a sequencing library”, as used herein, unless otherwise indicated, refers to the addition of one or more sequencing adaptors to the polynucleotides of the sample. Thus, the process of preparing a polynucleotide sample into a sequencing library may comprise the ligation of one or more sequencing adapters to the polynucleotides.
[0236] In addition to adaptor ligation, the process of preparing a polynucleotide sample into a sequencing library may comprise end repair and / or A-tailing of the polynucleotides.
[0237] Preferably the process of preparing a sequencing library does not comprise combining the polynucleotides together to form an extended ligated polynucleotide for use, for example, in a sequencing method comprising nanopore technology. The reaction mixture comprises a buffer mixture such as the labelling buffer, together with inactive methyltransferase, excess cofactor analogue, and the polynucleotide sample. In general, for sequencing applications, library preparation is typically conducted with purified DNA. It has surprisingly been found by the inventors, however, that the library preparation process may be performed directly in the reaction mixture following methyltransferase inactivation and that the efficiency of the library preparation process is not compromised by the use of a different buffer, or the presence of unpurified sample, such as DNA, and / or residual enzyme and cofactor components in the mixture. This finding provides a significant advantage over previous methods, offering significant efficiencies in terms of savings of time and reagents. In particular, the finding that any washing procedure may be avoided significantly preserves the level of polynucleotide sample present in the reaction mixture.
[0238] Thus, in such embodiments, the preparation of an sequencing library is preferably performed in a one-pot approach. Thus, in some embodiments, the method comprises the inactivation of the methyltransferase followed by library preparation without any intervening steps or clean-up process, for example, comprising removal of inactivated enzymes, exchange of reaction buffer, or isolation or purification of the polynucleotide sample. Performing library preparation at this stage, for example, prior to any enrichment process, and without the requirement for any washing steps or clean-up of the sample, surprisingly provides significant processing efficiencies, including significantly reducing any loss of polynucleotide sample.
[0239] In some embodiments, the sample may be subjected to a library preparation process comprising end repair of the polynucleotide sample.
[0240] In some embodiments, the end repair process may comprise removal of 3' overhangs, for example using a Klenow fragment-based enzyme. The end repair process may also comprise modifying 3’ ends as necessary to comprise a hydroxyl group.
[0241] In some embodiments, the end repair process may additionally or alternatively fill 5' overhangs, for example, using a T4 DNA polymerase. The end repair process may also comprise phosphorylation of 5' ends where necessary, for example, using of a T4 polunucleotide kinase (PNK). In some embodiments, the polynucleotide sample may be subjected to a library preparation process comprising A-tailing.
[0242] In some embodiments, the A-tailing process may comprise the addition of an adenosine residue to the 3' ends of the polynucleotide sample. This process may reduce the possibility of the polynucleotides in the sample ligating to each other. The A-tailing process may also increase the rate of adapter ligation, particularly in embodiments in which the adapters comprise a thymine overhang. The A-tailing process may comprise the use of an “exo-Klenow” enzyme.
[0243] In some embodiments, the polynucleotide sample may be subjected to end repair and A-tailing processes simultaneously. For example, an end repair and A-tailing buffer comprising end repair and A-tailing enzymes maybe used. In some embodiments, the polynucleotide sample may be subjected to a library preparation process comprising one or more “adapter ligation” processes comprising the ligation of sequencing adapters to the polynucleotides in the sample.
[0244] The term “adapter” as used herein, unless otherwise specified, refers to a short nucleic acid (such as less than about 500, less than too, or less than about 50 nucleotides in length) that is typically at least partially double-stranded and is attached to either or both ends of a nucleic acid molecule. The adapters may include a primer binding site for amplification of the sample. The adapters may include a primer binding site for sequencing applications, such as next-generation sequencing (NGS) applications. The adapters may include a binding site for capture probes, such as an oligonucleotide attached to a flow cell support. A plurality of adapters of the same or different sequences may be attached to the polynucleotides in the sample.
[0245] In some embodiments, the ligated adapters may include a nucleic acid tag. The nucleic acid tag may be positioned relative to an amplification primer and / or sequencing primer binding site, such that the tag sequence is included in subsequent amplicons and sequence reads. In some embodiments, a plurality of adapters having the same sequence apart from different nucleic acid tags may be attached to the polynucleotides in the sample. In some embodiments, the ligated adapters may include a barcode that can be introduced at one or both ends of the sample DNA molecule. A “barcode”, “indexing barcode”, or “molecular barcode” as used herein, unless otherwise stated, refers to a nucleic acid molecule comprising a sequence that can serve as a molecular identifier. A barcode may be a type of nucleic acid tag. For example, individual "barcode" sequences may be added to the polynucleotides in the sample for use in next-generation sequencing (NGS) so that the sequencing read can be identified and sorted before the final data analysis. In some embodiments, the adapter ligation process may comprise the ligation of sequencing adapters to the polynucleotides in the sample. The adapter ligation process may comprise the use of a T4 DNA ligase. Advantageously, any sequencing adapters may be used. Sequencing adapters that have been found to be particularly suitable for use in the disclosed method include, for example, any sequencing adapters suitable for use with high throughout sequencing methods, such as sequencing applications on the Illumina platform. The method may comprise the use of a double-stranded indexing and unique dual indexing (UDI) adapter that enables efficient ligation and identification of PCR amplification replicates in the sequencing dataset. Other adapters may also be used, such as hairpin adapters.
[0246] In some embodiments, the adapter ligation processes may comprise the ligation of adaptors that do not comprise indexing barcodes, and such adaptors may be referred to herein as “stubby adapters”. Barcodes may be applied to one or both ends of the polynucleotides as part of the library preparation process. In addition, or alternatively, barcodes maybe added to one or both ends of the polynucleotides in a separate amplification step. In preferred embodiments, the method does not comprise an amplification step as part of the library preparation process.
[0247] In some embodiments, the adapter ligation processes may comprise the ligation of Y adapters.
[0248] As used herein, unless otherwise stated, the term “Y adapter” refers to a polynucleotide sequence that may be annealed to the 5' and / or 3' end of a polynucleotide in a sequencing library. When annealed to both the 5’ and 3’ ends of the polynucleotide, the Y adapter allows different, noncomplementary, sequences to be added to the 5' and 3' ends of the library. The arms of the Y adapter comprise different sequences that maybe non-complementary, and the stem of the Y adapter, that is arranged to be ligated to the polynucleotide of interest, comprises double-stranded (i.e., complementary) DNA.
[0249] Labelling ofderivatised DNA
[0250] A methyltransferase is used to bind a tag to unmodified target nucleotides in the polynucleotide sample. Tags on the polynucleotides in the sample maybe modified by the addition of an affinity label. The affinity label may be referred to as an “affinity label” or “label” when bound to the tag and an “affinity label precursor” beforehand.
[0251] It has been found that an affinity label may be added to the tag by the addition of the affinity label precursor to the reaction mixture. The finding that an affinity label may be applied to the tag in this technically simple and efficient manner is advantageous in view of the fact that the reaction mixture comprises various components including, inactive methyltransferase, excess cofactor analogue, and the reagents and enzymes required for library preparation. The finding that the affinity label may be added to the tag in this way provides a significant advantage over previous methods, by avoiding the requirement for a washing step, thereby providing efficiency savings in terms of time and reagents and avoiding any loss of sample.
[0252] Thus, in some embodiments, the method comprises the addition of an affinity label to the tag after library preparation, without any intervening steps or clean-up process, for example, comprising removal of peptides or enzymes, exchange of reaction buffer, or isolation or purification of the sample.
[0253] In some embodiments, the affinity label may comprise biotin.
[0254] In some embodiments, the affinity label precursor maybe a compound of formula (II): R9-L4-R10
[0255] (ID wherein: R9is a reactive moiety configured to react with a group in the tag and to thereby form a bond therebetween; L4 is a linker; and
[0256] R10comprises or consists of biotin. R9may be an optionally substituted 5 to 30 membered heterocyclyl, an optionally substituted 5 to 30 membered heteroaryl, an optionally substituted Ce-3o membered aryl or an optionally substituted C3-3Ocycloalkyl.
[0257] A multicyclic group may be understood to be a group comprising two or more fused rings. Accordingly, a multicyclic group may have 2 or 3 fused rings.
[0258] As used herein, a “heterocyclyl”, “heterocyclic” or “heterocycle” group includes non- aromatic saturated or partially saturated mono and multicyclic groups. A heterocyclic ring contains 1 or more heteroatoms in the ring, which may independently selected from nitrogen, oxygen or sulfur. A multicyclic group may be understood to be multicyclic heterocyclyl group if it contains at least one heteroatom and at least one ring which is a non-aromatic saturated or partially saturated ring.
[0259] As used herein, a “cycloalkyl” group includes non-aromatic saturated or partially saturated mono and multicyclic groups. A multicyclic group may be understood to be multicyclic cycloalkyl group if it only contains carbon atoms in the rings and it contains at least one ring which is a non-aromatic saturated or partially saturated ring.
[0260] As used herein, a “heteroaryl” group includes aromatic mono and multicyclic groups. A heteroaryl ring contains 1 or more heteroatoms in the ring, which may independently selected from nitrogen, oxygen or sulfur. A multicyclic group may be understood to be multicyclic heteroaryl group if it contains at least one heteroatom and every ring is aromatic.
[0261] Preferably, R9contains a triple bond.
[0262] Preferably, R9is an optionally substituted 10 to 20 membered multicyclic heterocyclyl, an optionally substituted 10 to 20 membered multicyclic heteroaryl or an optionally substituted C10-20 multicyclic cycloalkyl. R9maybe a 14 to 18 membered multicyclic heterocyclyl, an optionally substituted 14 to 18 membered multicyclic heteroaryl or an optionally substituted Ci3-i8 multicyclic cycloalkyl. In a preferred embodiment, , wherein X2is N or CH. Preferably, X2is
[0263] N.
[0264] 17 may comprise between 1 and 12 groups, each group selected from an optionally substituted hydrocarbon, an optionally substituted polyether chain, NH, O, S or S-S.
[0265] The hydrocarbon may be an optionally substituted alkylene, preferably an optionally substituted C1-10 alkylene and more preferably a Ci-5alkylene. The alkylene may be substituted with an OH or oxo group. Preferably, the alkylene is substituted with an oxo group.
[0266] The optionally substituted polyether chain may be an optionally substituted polyethylene glycol chain. The polyethylene glycol chain may comprise up to 15 monomers, up to 10 monomers or up to 5 monomers of ethylene glycol.
[0267] Accordingly, 17 may have the structure
[0268] -L5-L6-L7-L8-*, wherein Ls to L8are each independently absent or an optionally substituted hydrocarbon, an optionally substituted polyether chain, an NH, O, S or S-S; and an asterisk indicates a point of bonding to R10.
[0269] In some embodiments, 17 is an optionally substituted hydrocarbon. Accordingly, Ls may be C0CH2CH2. In some embodiments, L6is NH.
[0270] In some embodiments, 17 is an optionally substituted hydrocarbon. Accordingly, 17 may be C0CH2CH2. In some embodiments, L8is an optionally substituted polyether chain. The optionally substituted polyether chain may be an optionally substituted polyethylene glycol chain. The polyethylene glycol chain may comprise up to 15 monomers, up to 10 monomers or up to 5 monomers of ethylene glycol. Accordingly, L8maybe (0CH2CH2)r, where r is an integer between 1 and 15, more preferably between 2 and 10 or between 3 and 5. In some embodiments, r is 4. Accordingly, L4may
[0271] L4may have no charge.
[0272] Negatively charged linkers have been found to react poorly with the tag. Preferably, IS is not negatively charged.
[0273] R10may have the following formula:
[0274] R11-(CH2)S-L9- wherein R11is biotin s is an integer between 1 and 8; and
[0275] L9is absent or is COO or CONH.
[0276] Sulfonated linkers have been found to react particularly poorly with the tag. Preferably, L4and R10are not sulfonated.
[0277] Preferably, the affinity label precursor is not DBCO-SS-biotin. Preferably, the affinity label precursor is not NHS-SS-biotin. A modified tagged cytosine residue, which comprises the affinity label, may be understood to have the following structure: wherein L4and R10are as defined above; and L10is a linker. L10maybe understood to be -CH2-U-[L3]ni-[HM]ll-[L2]p-L11-, wherein U, L2, L3, HM, m, n and p are as defined above and L11is a linker formed due to a reaction between the R6and R9groups.
[0278] Accordingly, L11may asterisk indicates a point of bonding to L4 and X2is as defined above.
[0279] The present inventors have surprisingly found that in previous methods, such as that described by Kriukiene et al. (Nature Communications 20134:2190), DNA fragments having a significant density of CpG sites, such as, for example, 5 or more CpG sites per 100 bp, may be underrepresented in the sequencing reads, thereby introducing bias to the results.
[0280] An advantage of the disclosed method is that if a purification process, such as DNA isolation, is performed at this point, it is the only clean-up step for the entire process, and this has been found to dramatically improve the efficiency and sensitivity of the process. This is made possible, firstly, by the inactivation of the methyltransferase and, secondly, the surprising finding that the enzymes used for library preparation exhibit high levels of activity in the resulting buffers, which are significantly different to the buffer mixtures designed for use in library preparation.
[0281] Thus, in some embodiments, the method may comprise, after the affinity labelling step (step (iv)), and before the fractionation step (step (v)), a step of purifying the polynucleotide. Preferably the method involves no more than one step of purifying the polynucleotide.
[0282] Any suitable method for purifying the polynucleotide maybe used. For example, DNA may be purified using a DNA purification kit. The DNA may be washed, for example, using ethanol, such as 80% ethanol, or other DNA washing buffer. After washing, the DNA may be eluted, for example, using a suitable elution buffer, such as phosphate buffer. Fractionatiori
[0283] The method comprises fractionating the derivatised polynucleotides of the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising a label, and the second fraction is enriched for polynucleotides lacking a label.
[0284] Fractionation is performed after the labelling of derivatised polynucleotides (step iv).
[0285] The labelled polynucleotides may be isolated and separated from the unlabelled polynucleotides with high efficiency.
[0286] In some embodiments, fractionation may comprise the use of the affinity label, such that polynucleotides comprising an affinity label are separated, using the affinity label, from polynucleotides lacking an affinity label. As a result, the first fraction is enriched for polynucleotides comprising an affinity label, in the sense that substantially or entirely all of the polynucleotides in the first fraction comprise an affinity label.
[0287] Similarly, the second fraction is enriched for polynucleotides lacking an affinity label, in the sense that substantially or entirely all of the polynucleotides in the second fraction lack an affinity label. The fractionation may comprise selectively isolating the labelled polynucleotides using the affinity label. For example, fractionation of the polynucleotides may comprise binding of the affinity label to a capture probe that specifically binds to the affinity label. In embodiments in which the affinity label comprises biotin, fractionation of the polynucleotides may comprise selectively isolating the labelled polynucleotides using a biotin-binding protein.
[0288] The biotin-binding protein may comprise, for example, streptavidin, avidin, and / or a biotin-specific antibody.
[0289] The biotin-binding protein may comprise streptavidin, or a functional analogue or derivative of streptavidin. The biotin-binding protein may comprise a separation medium or substrate. For example, the biotin-binding protein maybe conjugated to a surface. The surface may comprise a plurality of microbeads, such as paramagnetic microbeads. In embodiments in which the separation medium or substrate comprises a plurality of microbeads coated with biotin-binding protein, the method may comprise binding the labelled polynucleotides to the biotin-binding protein on the coated microbeads and then isolating the coated microbeads. Isolation of the microbeads maybe performed by centrifugation. In embodiments comprising the use of paramagnetic microbeads, isolation of the microbeads may comprise the application of a magnetic field to the reaction mixture to separate the beads from the remainder of the suspension.
[0290] For example, in some embodiments, the biotin-binding protein may comprise streptavidin conjugated to the surface of microbeads. Preferably the streptavidin- coated microbeads may be streptavidin-coated paramagnetic microbeads.
[0291] After an appropriate incubation to bind the labelled polynucleotides to the capture probe, the probe may be washed to remove unbound and non-specifically bound polynucleotides. In embodiments in which the affinity label comprises biotin, the biotin-binding protein may be washed by any suitable method to remove unbound and non-specifically bound polynucleotides.
[0292] After the selective isolation of the labelled polynucleotides, the polynucleotides are separated from the capture probe.
[0293] In previous methods, such as that described by Kriukiene et al. (Nature Communications 20134:2190), DNA fragments are released from a streptavidin capture agent using oxidative cleavage of a disulfide bond within the affinity label. However, this method has been found by the present inventors to be inconsistently reproducible and to have poor efficiency.
[0294] In the disclosed method, the polynucleotides are preferably not separated from the capture probe by a method comprising oxidative cleavage of the tag or affinity label. Thus, in some embodiments, the method does not comprise the separation of the polynucleotides from the capture probe by cleavage, such as oxidative cleavage or hydrolysis, of the tag or affinity label. In some embodiments, the method may comprise the denaturation of the capture probe. For example, in embodiments in which the capture probe comprises a biotinbinding protein, the method may comprise the denaturation of the biotin-binding protein. This method has been found to be particularly advantageous due to the consistent release of DNA fragments regardless of the number of the attached affinity labels.
[0295] The ability of streptavidin to bind to biotin is dependent on both a sterically defined binding pocket and the highly polar residues within it. Any agent that induces a conformational change of streptavidin may, therefore, be used to release the labelled polynucleotides. The inventors have found that, in embodiments in which the biotinbinding protein comprises streptavidin, the labelled polynucleotides maybe released from the streptavidin by any method that denatures streptavidin without damaging the polynucleotides. In some embodiments, the labelled polynucleotides may be released from the streptavidin by incubation in pure water at a temperature of about 7O°C.
[0296] In some embodiments, the labelled polynucleotides may be released from the streptavidin by incubation in 12-15% (v / v) phenol at room temperature.
[0297] In some embodiments, streptavidin may be denatured using a denaturing reagent, such as 1% sodium dodecyl sulphate and heating the sample to 9O°C.
[0298] Advantageously, because the affinity label is not damaged by this method comprising the denaturation of streptavidin, the first fraction may be further enriched for polynucleotides comprising an affinity label by repeating the selective isolation (affinity purification) step in one or more further cycles.
[0299] Amplification ofderivatised DNA The inventors have surprisingly found that, due to the efficiency of the disclosed method, amplification of the input sample is not required. Nevertheless, the inventors have surprisingly found that standard DNA polymerases are able to amplify polynucleotides that have been tagged and, in some embodiments, affinity labelled using the disclosed approach. Indeed, it has advantageously been found that DNA polymerases are able to amplify DNA comprising a plurality of affinity labels at a high density, without bias. The derivatised polynucleotides may, therefore, be amplified and sequenced without further modification. Moreover, amplification has advantageously been found to be possible under the conditions employed in the preceding steps of the labelling method. This negates the need for DNA purification prior to amplification and provides the possibility of performing the method as a one-pot approach, and automating the method. Thus, in some embodiments, the polynucleotides may be amplified:
[0300] (a) after step (iv);
[0301] (b) after step (v) and before step (vi);
[0302] (c) after step (vi) ; and / or
[0303] (d) after step (vii).
[0304] Amplification may be performed by any suitable method. For example, in some embodiments, the adapters that are ligated to the polynucleotides during library preparation may comprise a primer binding site that maybe used for the binding of an amplification primer. The polynucleotides may be amplified by PCR or qPCR, for example, using primers designed to anneal within the ligated sequencing adapters, and an appropriate PCR program.
[0305] Amplification may be used to simultaneously introduce indexing barcodes to the polynucleotides. Indexing barcodes maybe added to the polynucleotides of the first and second fractions for use in next-generation sequencing (NGS) so that the sequencing reads for each fraction can be identified and sorted before the final data analysis.
[0306] The use of indexing in Next Generation Sequencing, such as Illumina sequencing, allows DNA to be tracked at the single molecule level. This can be particularly advantageous when working at low DNA concentrations or when using a targeted sequencing approach, such as a liquid panel; and using PCR-based amplification of the genome, which can lead to the over-representation of individual DNA molecules in the final sequencing experiment. Indexes allow reads to be properly quantified and for replicated reads to be removed or combined bioinformatically. The indexing barcodes may comprise unique dual indexing (UDI) barcodes. The use of UDIs advantageously enables efficient ligation and identification of PCR amplification replicates in the sequencing dataset. Dual-indexed libraries comprise distinct, unrelated indexing barcodes at each end of the polynucleotides, thereby providing uniquely tagged libraries, and thus reducing demultiplexing noise resulting, for example, from index hopping.
[0307] In some embodiments, the indexing barcodes may alternatively, or preferably additionally, comprise unique molecular identifiers (UMIs). UMIs comprise a unique barcode comprising, for example, a random, short (e.g. 6-15) nucleotide sequence. One or more UMIs may be applied to each molecule within the sequencing library, to provide error correction and increased accuracy during sequencing. By incorporating individual barcodes on each original DNA fragment, variant alleles present in the original sample (true variants) can be distinguished from errors introduced, for example, by PCR methods, during library preparation, target enrichment, or sequencing.
[0308] Advantageously, the use of UMIs can reduce the rate of false-positive variant calls and increase the sensitivity of variant detection. The proportion of different variants in the sample, reflecting, for example, the mutation rate, can thus be accurately quantified. Since each nucleic acid in the starting material is tagged with a unique molecular barcode, bioinformatics software can filter out duplicate reads and PCR errors with a high level of accuracy and report unique reads, removing the identified errors before final data analysis. The use of UMIs has been found to be particularly advantageous in embodiments in which target enrichment is performed on the pooled fraction, as this allows identification of an individual polynucleotide in the output of the sequencing experiment. This provides reliable quantification of reads and the prevention / removal of spurious results by allowing identification of PCR duplicates in the sequencing reads. First (labelled) fraction
[0309] First and second subsets are formed from the first fraction. Any suitable method may be used to divide, separate, or otherwise produce first and second subsets from or of the first fraction.
[0310] The composition of each of the first and second subsets is substantially identical and representative of the composition of the first fraction. Thus, the proportions of constituent components within each subset are preferably substantially identical and the same as those of the first fraction.
[0311] In some embodiments, the first and second subsets may be equal or substantially equal in size. In some embodiments, the first subset may consist of or comprise about 20- 80% of the first fraction, such as about 30-70% or 40-60%, of the first fraction. In some embodiments, the first subset may consist of or comprise about 50% of the first fraction. In some embodiments, the second subset may consist of or comprise about 20-80% of the first fraction, such as about 30-70% or 40-60%, of the first fraction. In some embodiments, the second subset may consist of or comprise about 50% of the first fraction. In some embodiments, the first and second subsets may be unequal in size. Indeed, in some embodiments, the first and / or second subset may consist of or comprise a small sample that is removed from the first fraction. In some embodiments, the first subset may consist of or comprise less than 20% of the first fraction, such as less than 10, less than 5% or less than 1% of the first fraction. Likewise, in some embodiments, the second subset may consist of or comprise less than 20% of the first fraction, such as less than 10, less than 5% or less than 1% of the first fraction.
[0312] In some embodiments, the first subset may be sequenced and the sequencing information used to determine the modification status of nucleotide residues in the polynucleotide sample. The first subset is representative of the first fraction and as such substantially or entirely all of the polynucleotides in the first subset comprise, or are amplified from polynucleotides that comprised, an affinity label. The sequence data obtained from the first subset may be used to determine the modification status of nucleotide residues in the polynucleotide sample. In particular, the sequence data obtained from the first subset may be used to generate an unmethylome profile of the polynucleotide sample. -5i -
[0313] Using the disclosed method, and as demonstrated in the Examples, unmodified nucleotides, such as unmodified CpG dinucleotides, in the polynucleotide sample are derivatised in such a way that they may be isolated and subsequently sequenced without bias. Thus, the first subset may be used with extremely low sample quantities to provide a highly accurate epigenetic profile of the polynucleotide sample.
[0314] Moreover, due to the efficiency of the disclosed method, amplification of the input sample is advantageously not required, thereby avoiding the possibility of introducing artefacts and errors in the polynucleotide sequence as a result of amplification, and consequently improving the detection of true variant alleles present in the original sample.
[0315] The inventors have also surprisingly found that the presence of the disclosed affinity label on the polynucleotides advantageously does not interfere with the action of DNA polymerases. The derivatised polynucleotides may, therefore, be sequenced without further modification.
[0316] Based on this surprising finding, and the non-destructive nature of the disclosed epigenetic profiling method, the inventors have developed the disclosed method which provides simultaneous epigenetic and genetic mutation profiling of a polynucleotide sample.
[0317] In some embodiments, the first subset maybe subjected to target enrichment to selectively isolate specific genomic regions of interest for subsequent analysis.
[0318] Second (unlabelled) fraction
[0319] At least a portion of the second fraction may be combined with at least a portion of the second subset of the first fraction to form a pooled fraction.
[0320] The inventors have identified that, as a result of the second fraction being enriched for polynucleotides lacking an affinity label, when the polynucleotides of the second fraction are sequenced, poor coverage of the polynucleotide sample is obtained in regions comprising unmodified nucleotides. The second subset of the first fraction is representative of the first fraction, and as such, substantially or entirely all of the polynucleotides in the second subset comprise an affinity label. The pooled fraction, that is produced by combining portions of the second fraction and second subset, comprises polynucleotides from both the labelled and unlabelled fractions, and is representative of the initial sample as a whole. The production and sequencing of a pooled fraction has been found to allow the generation of sequencing information providing excellent and improved coverage of the polynucleotide sample as a whole.
[0321] In some embodiments, the method may comprise amplifying the polynucleotides of the second subset before forming the pooled fraction. In such embodiments, the method may further comprise determining the proportion of the initial polynucleotide sample that is present in the first fraction and adding a corresponding quantity of the amplified second subset to at least a portion of the second fraction to form a pooled fraction that has a polynucleotide content that is representative of the initial polynucleotide sample.
[0322] In some embodiments, the pooled fraction does not comprise a single pooled polynucleotide sample but is a conceptual fraction representing a sequencing operation in which the second fraction and second subset are sequenced in parallel. In such embodiments, the second fraction and second subset may be sequenced on the same or on separate sequencing experiments. In embodiments in which target enrichment is performed on the pooled fraction, however, the pooled fraction preferably comprises a single polynucleotide sample formed by combining at least a portion of the second fraction and at least a portion of second subset of the first fraction. The sequence data obtained from the pooled fraction encompasses the initial polynucleotide sample as a whole and thus maybe used to provide a genetic mutation profile of the modified and unmodified genomic fractions.
[0323] In some embodiments, indexing barcodes that are applied to the polynucleotides and may be used to distinguish the polynucleotides derived from the second fraction and second subset. In such embodiments, the sequence data obtained from the pooled fraction may be used to provide separate genetic mutation profiles for the modified and unmodified genomic fractions. Thus, the disclosed method advantageously maybe used to identify genetic mutations that are associated with epigenetic changes. For example, mutations of cytosine to thymine may be more frequent at methylated cytosines, which can deaminate spontaneously. Knowing the rates of genetic mutations associated with epigenetically modified regions of the genome may be advantageous for the diagnosis of disease.
[0324] Target enrichment Particular diseases and conditions may be associated with specific genomic regions. For this and other reasons, in some embodiments, target enrichment of the pooled fraction maybe performed. Target enrichment may also be performed to amplify promoter or other regulatory regions that are subject to epigenetic modification, and thus, in some embodiments, target enrichment of the first subset of the first fraction may be performed.
[0325] Target enrichment is used to describe a variety of strategies to selectively isolate specific genomic regions of interest for sequencing analysis. Any suitable method for target enrichment may be used in the disclosed method. The most suitable approach may depend on the specific aim of the study. For example, embodiments in which the aim is to enrich genomic regions that have clinical relevance may require a more focused enrichment strategy. On the other hand, embodiments in which the aim is to discover novel variants that may be associated with a given phenotype may require an enrichment strategy that provides a balance between sequencing costs and target coverage. For example, the diagnosis of specific cancers may require the identification of somatic variants present at extremely low abundance in cfDNA or in mixtures of malignant and stromal cells, which may necessitate an increased depth of sequencing coverage rather than a broader genomic approach which maybe economically impractical.
[0326] Any suitable method for target enrichment may be used.
[0327] In some embodiments, target enrichment may comprise an in-solution hybridizationbased approach. For example, in some embodiments, target enrichment may comprise the use of biotinylated oligonucleotide “bait” probes to capture genomic regions of interest, for example, using streptavidin-coated magnetic beads. Suitably, bait probes may comprise 50-150 nucleotides. Thus, in some embodiments, target enrichment may comprise in-solution hybridization using oligonucleotide “bait” probes specific for genomic regions of interest. In some embodiments, the target enrichment may comprise a PCR-based enrichment method. For example, in some embodiments, target enrichment may comprise the use of specifically designed primers to amplify in parallel up to 250 target regions using PCR. Other target enrichment strategies that may suitably be used include multiplex extension ligation, molecular inversion probes (MIPS) / padlock probes, nested patch PCR, and selector probes.
[0328] Sequencing The polynucleotides of the first subset, second fraction, and / or pooled fraction may be sequenced. Each fraction may be sequenced separately.
[0329] Using the disclosed method, and as demonstrated in the Examples, target unmodified nucleotides, such as unmodified CpG dinucleotides, in the polynucleotide sample are derivatised in such a way that they may be isolated and subsequently sequenced without bias. Thus, the first subset may be used with extremely low sample quantities to provide a highly accurate epigenetic profile of the polynucleotide sample.
[0330] The inventors have surprisingly found that the presence of the disclosed affinity label on the polynucleotides advantageously does not interfere with the action of DNA polymerases. The derivatised polynucleotides may, therefore, be amplified and sequenced without further modification. Based on this surprising finding, and the nondestructive nature of the disclosed method, the inventors have developed the disclosed method that may be used for determining both the status of nucleotide residues and the presence of a genetic mutation in a polynucleotide sample.
[0331] The pooled fraction is representative of the initial polynucleotide sample as a whole (e.g. the whole genome). Thus, sequencing the polynucleotides of the pooled fraction has been found to provide an accurate representation of mutation rates of the polynucleotide sample. The sequence data obtained from the pooled fraction may be used to produce an unbiased genetic mutation profile of the polynucleotide sample. In some embodiments, the polynucleotides of the first subset, second fraction, and / or pooled fraction maybe purified prior to sequencing by any suitable method for cleaning up PCR products for use in a sequencing platform.
[0332] In some embodiments, the polynucleotides of the first subset, second fraction, and / or pooled fraction maybe used directly for sequencing without further purification.
[0333] The term “sequencing” as used herein, unless otherwise indicated, refers to any method that may be used to determine the sequence (i.e. the order of nucleotides) in a nucleic acid such as DNA or RNA.
[0334] Any type of sequencing platform may be used to determine the sequences of the polynucleotides, in combination with the appropriately ligated sequencing adapter. Thus, sequencing approaches that may be suitable for use in the disclosed method include, but are not limited to, Sanger sequencing, high-throughput sequencing, pyrosequencing, sequencing-by-synthesis, single-molecule sequencing, nanoporebased sequencing, semiconductor sequencing, sequencing-by-ligation, sequencing-by- hybridization, RNA-Seq (Illumina), Digital Gene Expression (Helicos), next generation sequencing (NGS), Single Molecule Sequencing by Synthesis (SMSS) (Helicos), massively-parallel sequencing, Clonal Single Molecule Array (Solexa), shotgun sequencing, Ion Torrent, Oxford Nanopore, Roche Genia, Maxim-Gilbert sequencing, primer walking, sequencing using Singular Genomics, Ultima Genomics, Element Biosciences, PacBio, SOLiD, Ion Torrent, or Nanopore platforms. Sequencing reactions can be performed in a variety of sample processing units, which may include multiple lanes, multiple channels, multiple wells, or other means of processing multiple sample sets substantially simultaneously.
[0335] The method may comprise a high throughput sequencing method. The terms “next generation sequencing”, “NGS”, and “high throughput sequencing” as used herein, refer to sequencing technologies having increased throughput as compared to traditional Sanger- and capillary electrophoresis-based approaches. The high throughput sequencing method maybe capable of generating hundreds of thousands of sequence reads in parallel. The method may comprise a multiplex sequencing technique. The high throughput sequencing methods that may be used include, but are not limited to, sequencing by synthesis, sequencing by ligation, and sequencing by hybridization. The sequencing method maybe capable of sequencing single molecules.
[0336] In some embodiments, the disclosed method comprises sequencing the polynucleotides using the Illumina platform.
[0337] In some embodiments, the disclosed method may comprise comparing the sequencing reads of the first subset and / or pooled fraction to a reference sequence. In some embodiments, the disclosed method may comprise comparing the sequencing reads of the first subset and / or pooled fraction to a reference genome to determine the genomic location of the sequencing reads.
[0338] In some embodiments, the reference sequence may be a human genome, or may comprise one or more portions thereof, such as one or more chromosomes and / or chromosomal regions.
[0339] In some embodiments, the disclosed method may comprise aligning the sequencing reads of the first subset and / or pooled fraction to the reference sequence. Any suitable alignment method compatible with high throughput sequencing data may be used, for example, using the Burrows Wheeler Alignment algorithm.
[0340] Aligned reads may be normalised. Any method for normalisation used in the art may maybe used. For example, normalising the sequencing reads may comprise reporting the number of reads in an aligned region (bin) as a fraction of reads per million reads of the sequencing output.
[0341] The disclosed method may comprise comparing the normalised read counts from two or more sequencing experiments. Such an approach maybe advantageous in methods further comprising the step of diagnosing a disease based on the modification status of the nucleotide residues in the sample.
[0342] In a seventh aspect there is provided a method for preparing a profile of the modification status of nucleotide residues and / or any genetic mutations in the polynucleotide sample. In such embodiments, the method may further comprise comparing the sequences of the sequencing reads obtained using any of the fourth, fifth, or sixth aspects to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified nucleotide residues and / or genetic mutations at specific locations within the reference sequence, thereby generating a profile of the modification status of nucleotide residues and / or any genetic mutations in the polynucleotide sample.
[0343] Thus, in some embodiments, the method may further comprise comparing the sequences of the sequencing reads of the pooled fraction and / or the first subset to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified nucleotide residues and / or genetic mutations at specific locations within the reference sequence, thereby generating a profile of the modification status of nucleotide residues and / or any genetic mutations in the polynucleotide sample. In some embodiments, the method may further comprise comparing the sequences of the sequencing reads of the pooled fraction to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of genetic mutations at specific locations within the reference sequence, thereby generating a profile of any genetic mutations in the polynucleotide sample.
[0344] In some embodiments, the method may further comprise comparing the sequences of the sequencing reads of the first subset to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified nucleotide residues within the reference sequence, thereby generating a profile of the modification status of nucleotide residues in the polynucleotide sample.
[0345] In some embodiments, the method may further comprise comparing the sequences of the sequencing reads of the pooled fraction and the first subset to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of both unmodified nucleotide residues and genetic mutations at specific locations within the reference sequence, thereby generating a profile of the modification status of nucleotide residues and any genetic mutations in the polynucleotide sample. In some embodiments, the method may comprise:
[0346] (i) using a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position;
[0347] (ii) inactivating the methyltransferase;
[0348] (iii) preparing the polynucleotide sample into a sequencing library;
[0349] (iv) binding an affinity label to each tag;
[0350] (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label;
[0351] (vi) producing separate first and second subsets of the first fraction;
[0352] (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the polynucleotide sample;
[0353] (vii) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the polynucleotide sample;
[0354] (viii) sequencing the polynucleotides of the first subset, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample; and
[0355] (ix) comparing the sequences of the sequencing reads to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the status of specific nucleotides and / or genetic mutations at specific locations within the reference sequence.
[0356] In some embodiments, the method may be an in vitro method for diagnosing disease in a subject, the method comprising diagnosing the disease based on a profile obtained by the disclosed method of the seventh aspect, using a polynucleotide sample obtained from the subject.
[0357] In an eighth aspect, there is provided a profile of a region of a polynucleotide sample, comprising both the status of nucleotide residues and any genetic mutations in the region of the polynucleotide sample, wherein the profile is obtained or obtainable by the method of the seventh aspect. In some embodiments, the region may consist of or comprise one or more specific portions of the polynucleotide sample.
[0358] Methods comprising making a determination based on the modification status of nucleotide residues and the presence of any genetic mutations in a polynucleotide may comprise the production of a profile. For example, the profile may reflect the position of modified and / or unmodified residues and genetic mutations within a polynucleotide such as a portion of a genome or an entire genome. Accordingly, methods comprising making a determination based on the CpG modification status of a plurality of CpG dinucleotides in a polynucleotide and the presence of any genetic mutations may comprise the production of a profile, wherein the profile may reflect the position of modified and / or unmodified CpG dinucleotides and any genetic mutations within a polynucleotide such as a portion of a genome or an entire genome. Thus, in some embodiments, comparing the modification status of nucleotide residues and the presence of any genetic mutations in a polynucleotide from the subject to the modification status and genetic sequence of the corresponding residues in a reference sample may comprise comparing the profile obtained from the sample with the profile of a reference sample. For example, comparing the modification status of cytosine residues in CpG dinucleotides of a polynucleotide and the presence of any genetic mutations from the subject to the CpG modification status and genetic sequence of the corresponding residues in a reference sample may comprise comparing the profile obtained from the sample with the profile of a reference sample. In some embodiments, the profile from the reference sample may comprise a profile that is representative of a healthy individual. In other embodiments, the profile from the reference sample may comprise a profile that is obtained from, or indicative of a particular disease, such as, for example, a cancer. The method may comprise comparing the profile obtained from the sample with a database of profiles. The database of profiles may comprise a plurality of profiles relating to a single disease, wherein the disease maybe diagnosed in the subject from which the test sample was derived based on similarities between the profile of the test sample and the database of profiles. - 6o -
[0359] In other embodiments, the database of profiles may comprise a plurality of profiles representative of different diseases, wherein a disease maybe diagnosed in the subject from which the test sample was derived based on similarities between the profile of the test sample and one of more of the profiles within the database.
[0360] A comparison between profiles maybe made using any suitable method. For example, a comparison may be made statistically, using an appropriate metric, for example, a p- value. A comparison may also be made using a machine-learning platform. In embodiments in which the method comprises determining, for a plurality of nucleotides, the presence (modified) or absence (unmodified) of any chemical modification catalysed by a methyltransferase, “modification status” may also be referred to as the “profile”. Thus, the disclosed method maybe used to determine a profile of methyltransferase catalysed modifications within a polynucleotide sample.
[0361] In some embodiments, the method may comprise the detection of unmodified cytosine residues in CpG dinucleotides of a DNA sample. The terms “CpG”, “CpG site”, and “CpG dinucleotide”, are used interchangeably herein to refer to a cytosine-phosphate-guanine sequence in a 5’ to 3’ direction in the backbone of a nucleic acid.
[0362] The terms “CpG modification status” and “modification status” as used interchangeably herein, unless otherwise stated, refer to the presence (modified) or absence (unmodified) of any chemical modification at the C5 position of cytosine within one or a plurality of CpG dinucleotides.
[0363] In some embodiments, the method may be a method for preparing both a profile of the modification status at the cytosine C5 position of CpG dinucleotides and a genetic mutation profile, wherein the method further comprises comparing the sequences of the sequencing reads of the prepared polynucleotide sample to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified cytosine residues in specific CpG dinucleotides and / or genetic mutations within the reference sequence.
[0364] In embodiments in which the method comprises determining, for a plurality of CpG dinucleotides, the presence (modified) or absence (unmodified) of any chemical modification at the C5 position of each of the plurality of cytosines, the “CpG modification status” and “modification status” may also be referred to as the “profile”.
[0365] In embodiments in which the profile corresponds to the entire genome, the profile may be referred to as the “unmethylome profile” or “unmethylome”.
[0366] The terms “genetic profile”, and “genetic mutation profile” as used herein, unless otherwise indicated, may be used to refer to the presence or absence of a genetic mutation at each position within the polynucleotide sequence of interest. Thus, the disclosed method may be used to determine a profile of genetic mutations within a polynucleotide sample.
[0367] A profile may comprise both the modification status of nucleotide residues and the genetic profile of a polynucleotide from the subject.
[0368] In some embodiments, the polynucleotide maybe a DNA sample, or maybe a mixed sample, comprising DNA and RNA. In some embodiments, the sample maybe a DNA sample comprising an epigenome. The terms “epigenome” and “epigenetic” as used herein, unless otherwise specified, refer to the chemical modification of a polynucleotide or genome in such a way that gene expression is regulated.
[0369] Thus, in some embodiments, the method may be a method for determining both the epigenetic profile and genetic mutation profile of a genomic DNA sample, the method further comprising determining the epigenetic profile based on the modification status of nucleotide residues in the sample. For example, the method may comprise determining the epigenetic profile based on the modification status of cytosine residues in CpG dinucleotides of the sample, i.e. the CpG modification status.
[0370] In some embodiments, the method may be a method for analysing a polynucleotide sample, such as a DNA sample, from a subject.
[0371] The method preferably does not comprise the use of bisulfite, such as bisulfite conversion of cytosine to uracil. The method preferably does not comprise pyridine borane base conversion or enzymatic deamination of unmethylated cytosine.
[0372] The method preferably does not comprise the use of methyl-binding antibodies or methyl-DNA Immunoprecipitation (also referred to as “MeDIP-Seq”).
[0373] The method preferably does not comprise the use of methyl-binding domain protein (referred to as “MBD-Seq”). The sequencing library may be suitable for use with high throughout sequencing methods. Sequencing the polynucleotides preferably comprises the use of next generation sequencing, such as sequencing applications on the Illumina platform. The method preferably does not comprise the use of a nanopore-based sequencing method.
[0374] In some embodiments, the method may be a method for determining both the modification status of one or more specific nucleotides and any genetic mutations in a polynucleotide sample from a subject. In some embodiments, the method may be a method for determining both the modification status of the cytosine residue in one or more specific CpG dinucleotides and any genetic mutations and / or any specific genotype in one or more biomarkers in a sample from a subject. In some embodiments, the method may be a method for determining both the modification status of the cytosine residues in one or more CpG dinucleotides and any genetic mutations in a plurality of genomic regions in a sample from a subject.
[0375] In some embodiments, the method may be a method for determining both the modification status of one or more specific adenine nucleotides and any genetic mutations and / or any specific genotype in one or more biomarkers in a sample from a subject.
[0376] In some embodiments, the method may be a method for determining both the modification status of one or more adenine residues and any genetic mutations in a plurality of genomic regions in a sample from a subject. In some embodiments, the method may be an in vitro method performed on a polynucleotide sample that has previously been obtained from a subject. The term “subject”, as used herein, may refer to any type of organism, including for example, a mammalian species (such as a human or domesticated animal), other animal species, a plant such as a crop, or other type of organism, including single celled organisms, and viruses. The subject may be a developing organism, such as an embryo or foetus. The subject maybe a healthy individual. The subject maybe an individual that has, or is suspected of having, a disease or predisposition to a disease. The subject may be an individual in need of therapy or suspected of needing therapy.
[0377] In some embodiments, the method may be a method for determining the disease status of a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and determining the disease status based on the modification status and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and determining the disease status based on the CpG modification status and any genetic mutations present.
[0378] In some embodiments, the method may be a method for diagnosing a disease in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and diagnosing the disease based on the modification status and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and diagnosing the disease based on the CpG modification status and any genetic mutations present.
[0379] In some embodiments, the method may be a method for making a disease prognosis in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and making a disease prognosis based on the modification status and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and making a disease prognosis based on the CpG modification status and any genetic mutations present.
[0380] Methods comprising making a determination based on the nucleotide modification status may comprise comparing the modification status of specific nucleotide residues in a polynucleotide from the subject to the modification status of the corresponding residues in a reference sample. Accordingly, methods comprising making a determination based on the CpG modification status may comprise comparing the modification status of cytosine residues in CpG dinucleotides of a polynucleotide from the subject to the CpG modification status of the corresponding residues in a reference sample.
[0381] Likewise, methods comprising making a determination based on the presence of a genetic mutation in a polynucleotide in a sample may comprise comparing a nucleotide sequence from a polynucleotide in the sample to the nucleotide sequence of the corresponding residues in a reference sample.
[0382] In some embodiments, the reference sample may comprise a polynucleotide from a healthy subject. The reference sample may comprise a polynucleotide from a diseased subject. The reference sample may comprise a polynucleotide from the same subject as the test sample, taken at a different time point and / or from a different location in the body. Differences in the nucleotide modification status between the test and reference samples, and / or the presence of a genetic mutation, maybe indicative of the presence or absence of a particular phenotype or clinical feature.
[0383] In some embodiments, the method may be a method for treating a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, diagnosing a disease based on the nucleotide modification status and any genetic mutations present, and providing a therapeutic composition to the subject to treat the disease based on the diagnosis. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, diagnosing a disease based on the CpG modification status and any genetic mutations present, and providing a therapeutic composition to the subject to treat the disease based on the diagnosis.
[0384] In some embodiments, the method may be a method for determining a personalised or precision method for treatment for a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, determining the disease status of the subject based on the nucleotide modification status and any genetic mutations present, and determining a personalised medical treatment for the subject based on the disease status. For example, the method may comprise determining both the modification status of specific cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, determining the disease status of the subject based on the CpG modification status and any genetic mutations present, and determining a personalised medical treatment for the subject based on the disease status.
[0385] In some embodiments, the method may be a personalised or precision method for treating a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, determining the disease status of the subject based on the nucleotide modification status and any genetic mutations present, and providing a personalised medical treatment to the subject based on the disease status. For example, the method may comprise determining both the modification status of specific cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, determining the disease status of the subject based on the CpG modification status and any genetic mutations present, and providing a personalised medical treatment to the subject based on the disease status. In some embodiments, the subject may be an individual that has been diagnosed with having a disease. The subject maybe an individual that has been identified as being predisposed to, or at risk of having, a disease. The subject may be an individual that has not been diagnosed with having a disease.
[0386] In some embodiments, the subject maybe an individual that has been diagnosed with cancer. The subject may be pending or undergoing treatment such as a cancer therapy.
[0387] The subject can be in remission of a cancer.
[0388] Cancer can be identified on the basis of epigenetic variations. Cancer maybe associated with both DNA hypomethylation and hypermethylation, but these two types of epigenetic abnormalities may affect different DNA sequences, and occur at different stages of cancer progression. For example, genomic hypermethylation in cancer maybe seen in CpG islands in gene regions, whereas hypomethylation may be observed in repeated DNA sequences in cancer, including heterochromatic DNA repeats, retrotransposons, and endogenous retroviral elements. In addition, unique sequences, such as transcription control sequences, are often subject to cancer-associated hypomethylation. These epigenetic changes may be detected using the disclosed method.
[0389] Cancer is also associated with the presence of genetic mutations. Hundreds of different genetic mutations, including sequence alterations, insertions, and deletions, have been found to be associated with different cancers. Genetic mutations associated with certain cancers may occur in specific genomic regions, for example, spontaneously from exposure to a carcinogen, or as an inherited genetic variant. Cancer-associated genetic mutations are known to frequently occur in the transcriptional regulatory regions of genes, including epigenetically regulated regions, and it is broadly understood that the cancer genome is hypomethylated, relative to a healthy genome. The disclosed method advantageously allows profiling of the mutation rates in epigenetically modified and unmodified regions of the genome, providing an improved method for the diagnosis of disease. In this context, references to “genetic mutations” encompass tumour- associated genotypes.
[0390] Thus, the method maybe a method for diagnosing cancer in subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and making a cancer diagnosis based on the nucleotide modification status and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and making a cancer diagnosis based on the CpG modification status and any genetic mutations present.
[0391] The frequency of cancer-linked DNA hypomethylation, the nature of the affected sequences, and the absence of associations with DNA hypermethylation are believed to suggest a role for DNA hypomethylation early in carcinogenesis and cancer formation, but can also be associated with tumour progression.
[0392] Thus, in some embodiments, the method may be a method for detecting cancer in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and determining the presence or absence of cancer based on the modification status of the nucleotide residues and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and determining the presence or absence of cancer based on the modification status of the cytosine residues and any genetic mutations present. The method may comprise determining both the modification status of adenine residues in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, and determining the presence or absence of cancer based on the modification status of the adenine residues and any genetic mutations present.
[0393] In some embodiments, the method may comprise the analysis of a plurality of genomic regions, and detecting the presence or absence of cancer from the modification status of nucleotide residues and the presence of any genetic mutations in the plurality of genomic regions.
[0394] In some embodiments, the method may be a method for detecting any type of cancer. Different types of cancer may be preferentially detected and / or analysed using different sampling approaches based on the disclosed method. In some embodiments, the method may be a method for treating cancer in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, diagnosing a cancer based on the nucleotide modification status and any genetic mutations present, and providing a therapeutic composition to the subject to treat the cancer based on the diagnosis. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, diagnosing a cancer based on the CpG modification status and any genetic mutations present, and providing a therapeutic composition to the subject to treat the cancer based on the diagnosis.
[0395] In some embodiments, the method may be a method for determining a personalised or precision method for cancer treatment for a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, determining the genetic profile of the cancer based on the nucleotide modification status and any genetic mutations present, and determining a personalised medical treatment for the subject based on the genetic profile of the cancer. For example, the method may comprise determining both the modification status of specific cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, determining the genetic profile of the cancer based on the CpG modification status and any genetic mutations present, and determining a personalised medical treatment for the subject based on the genetic profile of the cancer.
[0396] In some embodiments, the method may be a personalised or precision method for treating cancer in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, determining the genetic profile of the cancer based on the nucleotide modification status and any genetic mutations present, and providing a personalised medical treatment to the subject based on the genetic profile of the cancer. For example, the method may comprise determining both the modification status of specific cytosine residues in CpG dinucleotides of a polynucleotide in a sample from the subject and the presence of a genetic mutation in the sample using the disclosed method, determining the genetic profile of the cancer based on the CpG modification status and any genetic modifications present, and providing a personalised medical treatment to the subject based on the genetic profile of the cancer.
[0397] In some embodiments, the method may be an in vitro method performed on a DNA sample that has previously been obtained from a tissue biopsy. Biopsy is a diagnostic procedure for cancers and other diseases. For example, tissue biopsy may provide material for cancer genotyping, which may assist in the design of targeted therapeutic approaches.
[0398] In some embodiments, the method may be a method for genotyping a cancerous or otherwise diseased tissue. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide from a biopsy of the tissue using the disclosed method. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a polynucleotide and the presence of a genetic mutation in the polynucleotide from a biopsy of the tissue using the disclosed method. The method may further comprise designing a targeted therapeutic approach based on the nucleotide modification status and any genetic mutations present.
[0399] In some embodiments, the method may be performed on a DNA sample that has previously been obtained from a biopsy of any type of tissue from a subject. Existing tissue biopsy-based cancer diagnostic procedures may have limitations in relation to the analysis of the development and progression of certain types of cancers, due to tumour heterogeneity and evolution.
[0400] Liquid biopsy, which has the advantage of minimal invasiveness, has shown potential in detecting cancers, including early stage cancers and pre-cancerous lesions. The analysis of cell-free DNA (“cfDNA” or “circulating cfDNA”), which refers to DNA present at very low concentration in various bodily fluids, comprises extracellular nucleic acid fragments, for example, released by damaged cells during apoptosis, necrosis, or secretion. cfDNA has been found to exhibit the genetic and epigenetic alterations of cancers, including mutations, copy number alterations, chromosomal rearrangements, hypermethylation, and hypomethylation. The analysis of cfDNA has the potential to revolutionise the detection of early stage cancers and other diseases. In samples from cancer patients, cfDNA may comprise circulating tumour DNA (“ctDNA”), which is cell free tumour-derived fragmented DNA in a bodily fluid. Thus, cfDNA may comprise ctDNA. As demonstrated in the enclosed examples, the disclosed method has advantageously been found to be capable of providing high quality and consistently reproducible results from the very low concentrations of nucleic acid that are typically present in liquid biopsy (such as circulating cfDNA) samples. Thus, the sample for use in the disclosed method may be a cfDNA sample. The sample may consist of or comprise ctDNA.
[0401] Various liquid biopsy samples may be used for the analysis of cfDNA, including blood, plasma, urine, and spinal fluid. Preferably the liquid biopsy sample may be a blood or plasma sample. Thus, in some embodiments, the method may be an in vitro method for diagnosing disease in a cfDNA sample from a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a cfDNA sample from the subject, and diagnosing the disease based on the nucleotide modification status and any genetic mutations present. For example, the method may comprise determining both the modification status of cytosine residues in CpG dinucleotides of a cfDNA sample from the subject and the presence of a genetic mutation in the sample, and diagnosing the disease based on the CpG modification status and any genetic mutations present. In some embodiments, the method may also be used to determine the tissue of origin of the nucleic acid present in a liquid biopsy sample.
[0402] Thus, in some embodiments, the method may be a method for identifying the cellular origin of cfDNA in a sample from a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide in a sample from the subject using the disclosed method, and identifying the cellular origin of the DNA based on the modification status of nucleotide residues and any genetic mutations present in the sample. In some embodiments, the method may be a method for diagnosing the recurrence of cancer in a subject. Accordingly, the method may comprise determining both the modification status of nucleotide residues and the presence of a genetic mutation in a cfDNA sample from the subject, comparing the nucleotide modification status and genetic mutation profile to the nucleotide modification status and genetic mutation profile of a tumour sample from the subject that has previously been determined using the disclosed method, and diagnosing the recurrence of the cancer in the subject based on the comparison. The cfDNA sample from the subject maybe a blood sample. The tumour sample from the subject may be a sample of a solid tumour, for example previously obtained from the subject in a surgical procedure. In some embodiments, the method may be a method for sequencing polynucleotides, the method comprising sequencing the polynucleotides of the first and / or pooled fraction to generate a plurality of sequencing reads. In some embodiments, the method may further comprise enriching the pooled fraction for one or more genome regions of interest prior to sequencing. The method may further comprise comparing the sequences of the sequencing reads to a reference sequence or genome to determine the genomic location of the sequencing reads.
[0403] In some embodiments, the method may be a method for preparing a profile of the modification status of nucleotide residues and any genetic mutations in a reference sequence, such as one or more regions of a genome. Accordingly, the method may comprise separately sequencing the polynucleotides of the first and pooled fractions to generate a plurality of sequencing reads, comparing the sequences of the sequencing reads to the reference sequence, such as a reference genome or genomic region, to determine the location, such as the genomic location, of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified nucleotide residues and any genetic mutations at specific locations within the reference sequence, such as the one or more regions of the genome.
[0404] For example, the method may be a method for preparing a profile of the modification status at the cytosine C5 position of each CpG dinucleotide and any genetic mutations in a reference sequence, such as one or more regions of a genome. Accordingly, the method may comprise separately sequencing the polynucleotides of the first and pooled fractions, to generate a plurality of sequencing reads, comparing the sequences of the sequencing reads to the reference sequence, such as a reference genome or genomic region, to determine the location, such as the genomic location, of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified cytosine residues in specific CpG dinucleotides and any genetic mutations within the reference sequence, such as across the genome or across one or more regions of the genome. In another example, the method may be a method for preparing a profile of the modification status of cytosine residues at the N4 position and any genetic mutations in a reference sequence, such as one or more regions of a genome. Accordingly, the method may comprise separately sequencing the polynucleotides of the first and pooled fractions to generate a plurality of sequencing reads, comparing the sequences of the sequencing reads to the reference sequence, such as a reference genome or genomic region, to determine the location, such as the genomic location, of the sequencing reads within the reference sequence and thereby the presence or otherwise of cytosine residues unmodified at the N4 position, and any genetic mutations, at specific locations within the reference sequence, such as the across the genome or across one or more regions of the genome.
[0405] In another example, the method may be a method for preparing a profile of the modification status of adenine nucleotides at the N6 position and any genetic mutations in a reference sequence, such as one or more regions of a genome. Accordingly, the method may comprise separately sequencing the polynucleotides of the first and pooled fractions to generate a plurality of sequencing reads, comparing the sequences of the sequencing reads to the reference sequence, such as a reference genome or genomic region, to determine the location, such as the genomic location, of the sequencing reads within the reference sequence and thereby the presence or otherwise of adenine residues unmodified at the N6 position, and any genetic mutations, at specific locations within the reference sequence, such as the across the genome or across one or more regions of the genome.
[0406] In a ninth aspect there is provided a kit for determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide sample, the kit comprising:
[0407] (i) a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag from a cofactor analogue to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position; (ii) an affinity label precursor suitable for binding an affinity label to the tags, wherein the affinity label comprises biotin; and
[0408] (iii) a capture agent comprising a biotin-binding protein for fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label.
[0409] In some embodiments, the kit may further comprise a cofactor analogue suitable for use with the methyltransferase enzyme.
[0410] In some embodiments, the kit may further comprise sequencing adaptors for preparing the polynucleotide into a sequencing library. The sequencing adaptors for preparing the polynucleotide into a sequencing library may further comprise enzymes and reagents for reverse transcription, end repair, A-tailing, and / or adapter ligation, as described in accordance with the first aspect.
[0411] In some embodiments, the kit may or may not comprise a capture agent comprising a biotin-binding protein for fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label.
[0412] In some embodiments, the kit may further comprise a releasing agent for releasing the affinity label from the capture agent by denaturation of the biotin-binding protein.
[0413] In some embodiments, any of the disclosed kits may further comprise a biotinylated oligonucleotide “bait” probe to capture a genomic region of interest. The biotinylated oligonucleotide “bait” probe to capture a genomic region of interest maybe as described herein.
[0414] In some embodiments, the kit may further comprise specifically designed primers to amplify a target region of the genome. The kit may further comprise reagents to perform PCR on the target region of the genome using the primers. The specifically designed primers to amplify a target region of the genome may be as described herein.
[0415] The kit may be suitable for use in or as a one pot method as described herein. The methyltransferase enzyme may be a methyltransferase as described herein.
[0416] The cofactor analogue may be a cofactor analogue as herein. The cofactor analogue may be synthetic methyltransferase cofactor analogue. The cofactor analogue may comprise a compound of formula (I).
[0417] The affinity label precursor maybe an affinity label precursor as described herein. The affinity label precursor may comprise a compound of formula (II).
[0418] The capture agent comprising a biotin-binding protein for fractionating the sequencing library may be as described herein.
[0419] The releasing agent for releasing the affinity label from the capture agent by denaturation of the biotin-binding protein may be as described herein.
[0420] All features described herein (including any accompanying claims and drawings), and / or all of the steps of any method or process so disclosed, may be combined with any of the above aspects in any combination, except combinations where at least some of such features and / or steps are mutually exclusive.
[0421] The invention will now be illustrated by reference to specific Examples showing how embodiments may be carried into effect, which are not intended to be limiting. Data from the Examples is presented in the Figures, in which:
[0422] Figure 1A is a flow chart showing an overview of a method for producing the disclosed fractionated sequencing library.
[0423] Figure 1B is a diagram providing an overview of the production of a profile using the disclosed method. Purified fragmented DNA (cfDNA or fragmented genomic DNA) is derivatised by treatment with a CpG-targeting methyltransferase (in this case M.Mpel) and a synthetic cofactor analogue (in this case ETA-AdoHcy-N3) that results in the addition of tags at unmodified CpG sites. The methyltransferase is inactivated and the DNA fragments are then end-repaired and ligated to sequencing adapters. Fragments comprising a tag are subsequently labelled by attaching an affinity label (for example, biotin) to the tag, and isolated (for example, using streptavidin-coated magnetic beads). Tagged (unmodified CpG sites) and untagged (predominantly 5mCpG and 5hmCpG sites) can be fractionated to form the first (labelled) and second (unlabelled) fractions.
[0424] Figure 1C is a flow chart of an embodiment of the disclosed method comprising the formation of first and second subsets of the first (labelled i.e. unmodified) fraction, and subsequent pooled fraction comprising a subset of the first fraction and the second (unlabelled i.e. modified) fraction, for example, for determining both the modification status of nucleotide residues and the presence of a genetic mutation in a polynucleotide sample.
[0425] Figure 2A is a graph showing sequencing adapter ligation efficiencies using unlabelled DNA (control; left-hand cluster), methyltransferase-labelled DNA without methyltransferase inactivation (centre cluster), and methyltransferase-labelled DNA with methyltransferase inactivation (right-hand cluster), prior to adapter ligation.
[0426] Figure 2B shows the raw data for this experiment.
[0427] Figure 3A is a bar chart showing efficiencies for capture (light bars) and capture / release (dark bars) of target DNA from solution, as a function of target CpG site density.
[0428] Figures 3B and 3C are reproduced from Kriukiene et al. (Nature Communications 20134:2190). Figure 3D shows the enrichment efficiency of the present method for a (target) DNA molecule with a high density of CpG sites (10 sites per -150 bp). The target DNA is mixed with 24 ng of non-target DNA and is selectively purified with high efficiency at a range of concentrations. Final DNA concentration was quantified using spectrophotometry (Qubit).
[0429] Figure 3E shows enrichment of unmethylated DNA as a function CpG density using three different enrichment chemistries. The current approach, using a single pot reaction and one purification step (dark grey bars) shows over three times improvement in the retention of DNA throughout the Tag-Seq enrichment process, compared to the approach described by Kruikiene et al. (light grey) and an approach using two DNA purification steps (grey). Density of CpG sites is shown as number of sites per 300 bp genomic window.
[0430] Figure 4 is a graph showing threshold cycle versus target DNA concentration (ng) showing a linear response from 1.25 ng target DNA down to 1.25 pg target in a background of 24 ng (over 19OOOX excess) of DNA containing no target sites.
[0431] Figure 5 is a bar chart showing a comparison of sequencing coverage at CpG sites in the enriched (unmodified CpG) fraction (light grey) of a DNA sample, and the unenriched (modified CpG) fraction (dark grey) of the sample. Note that 1.4% and
[0432] 44.7% of reads did not contain a CpG site for the enriched and unenriched fractions, respectively.
[0433] Figures 6A and 6B are bar charts showing sequencing coverage of enriched (light grey) and unenriched (dark grey) fractions at unmodified CpG sites (Figure 6A) and methylated CpG sites (Figure 6B). Note the different y-axis scales in the two plots.
[0434] Figure 7A is a bar chart showing enrichment using the disclosed method (NRPM) across a range of unmodified CpG densities (75 bp window) (groups 1-4, light grey bars) compared to similar enrichment using the known MeDIP-Seq method (group 5, hashed bars), with baseline for whole-genome sequencing provided for context (group 6, white bars).
[0435] Figure 7B is a bar chart corresponding to Figure 7A showing the corresponding enrichment profiles for methylated CpG sites.
[0436] Figure 8A shows example profiles obtained using the disclosed method (dark blue) compared to MeDIP-Seq (green) and WGBS (yellow) profiles for (top) the KRAS gene and (bottom) a megabase-scale region of chromosome 1.
[0437] Figure 8B shows the enrichment of read counts for two technical replicates of the profile obtained using the disclosed method (blue) MeDIP-Seq (yellow) and shallow whole genome sequencing (red) at gene transcription start sites (TSS). Figure 8C shows the profile obtained using the disclosed method (blue) correlates inversely with the MeDIP-Seq profile (purple) and with chromatin domain organisation identified in Hi-C experiments (red) on the megabase scale. Figure 9A shows enrichment of genomic DNA at regions corresponding to H3K4 monomethylation. Comparison of two experiments using the disclosed method (technical repeats, different users) (light and dark blue), MeDIP-Seq (green) and shallow whole genome sequencing (control, no enrichment) (orange). Figure 9B shows enrichment of genomic DNA at regions corresponding to H3K4 trimethylation. Comparison of two experiments using the disclosed method (technical repeats, different users) (light and dark blue), MeDIP-Seq (green) and shallow whole genome sequencing (control, no enrichment) (orange). Figure 9C shows enrichment of genomic DNA at regions corresponding to H3K27 acetylation. Comparison of two experiments using the disclosed method (technical repeats, different users) (light and dark blue), MeDIP-Seq (green) and shallow whole genome sequencing (control, no enrichment) (orange). Figure 10 shows a Spearmann correlation analysis to assess the similarity of the profiles obtained using the disclosed method across six different cell lines and for three technical repeats of each sample. Dark blue indicates a high degree of similarity of the profiles. Notably, DNA from each cell line has a clearly distinct profile when compared to the sample technical repeats, consistent with the known utility of methylation profiles for the identification of tissues.
[0438] Figure 11 shows volcano plots showing the comparison of tumour and normal adjacent tissue profiles obtained using the disclosed method, across the genome for six patients with a range of different cancers. Differentially methylated windows are defined as those with an adjusted p-value of less than 5% and a log-fold change in signal of greater than 0.58 (1.5X). Red lines indicate the locations of these thresholds in the volcano plots. Blue markers are windows that show hypermethylation in cancer, red markers are for windows that are hypomethylated in cancer. Figures 12A and 12B show DNA Agilent TapeStation (Cell-free DNA ScreenTape) traces showing profiles for the cell free DNA (cfDNA) that was input for the disclosed profiling method (top) and the output from the enriched, amplified libraries (bottom) of the disclosed method, for a healthy patient sample (Figure 12A) and for a sample from a patient with Stage 1 non-small cell lung cancer (Figure 12B). The mono- and dinucleosomal pattern of the input cfDNA is maintained in the final libraries, the size of which corresponds to the duplicated original strand plus the Illumina P5 / P7 sequencing adaptors.
[0439] Figure 13 shows example profiles using the disclosed method for a genomic region (SHOX2 gene, a known methylation biomarker for lung cancer) in the healthy (blue) and lung cancer (grey) patients, compared to genomic DNA, extracted from the healthy patient’s buffy coat (yellow). Black traces show the duplicate profiles in both cases, dark blue tick marks show the known CpG site density across the gene. Traces are based on normalised read counts for all profiles and the cfDNA samples are displayed on the same scale for direct comparison.
[0440] Figure 14A-C shows a summary of data derived from triplicate repeat experiments using DNA isolated from FFPE (formalin-fixed, paraffin-embedded) samples. A) Enrichment (normalised read count) as a function of CpG site density (75 bp windows) showing steadily increasing levels of enrichment with increasing CpG density. B) Plots showing normalised read counts across the APC gene transcription start site.
[0441] Consistent with the enrichment profiles in (A), enrichment of DNA at the CpG-dense gene promoter is more marked for Patient A (pink) than Patient B (blue) than Patient C (green). Profile of the HT-29 cell line (colorectal cancer) shown in grey for comparison.
[0442] C) PCA plot showing excellent consistency of the technical replicates of these samples.
[0443] Figure 15 shows an example profile for a genomic region (TP53 gene) showing epigenetic profile from the first subset of the first (labelled i.e. unmodified) fraction (A - top row); profile of DNA of the same first subset of the first (labelled i.e. unmodified) fraction shown in the top row, that has additionally been enriched using a targeted enrichment / sequencing panel (B - middle row); and profile of DNA of the second (unlabelled i.e. modified) fraction, enriched using the same targeted enrichment / sequencing panel (C - bottom row).
[0444] Examples The disclosed method may be used to provide a uniquely straightforward and robust approach to epigenetic profiling that provides concurrent or simultaneous readout of genetic features of the genome of interest. The method is an enzymatic technology that enriches for unmodified nucleotide residues such as, in particular, CpG sites, across the polynucleotide sample, such as across the whole genome. The method requires no de novo knowledge of the polynucleotide sequence for its application in epigenetic profiling. It does not damage the nucleic acid sample nor rely on base conversion and therefore allows concurrent analysis of other genetic features, such as mutations. The method may also provide a profile whose signal correlates with markers of active genomic regions. The enzymatic chemistry enables unbiased fractionation of modified and unmodified nucleic acids from a sample for subsequent analysis.
[0445] The Examples below show that the epigenetic aspect of the disclosed method is a uniquely sensitive approach, relative to other available methods for epigenetic analysis. The method can be applied at polynucleotide (e.g. DNA) concentrations that are compatible with single-cell analysis (picogram inputs). The workflow also enables concurrent readout of a genome’s genetic and epigenetic features.
[0446] As a result of the simplicity of the approach, comprising a single enzymatic step, followed by fractionation using a capture probe, technical repeats of the experiment show excellent consistency. Enrichment of unmodified nucleic acids leads to an epigenetic profile that achieves saturation at around 70M, 150 bp reads. Studies in cell lines and in patient tissue samples demonstrate the potential of the disclosed method as a platform for the diagnosis of disease and the identification of tissue of origin in a sample, for example. The underlying chemistry requires no a priori assumptions to be made about the sample, making the platform ideally suited as a research tool, for example, for the discovery of novel biomarkers of disease.
[0447] A flow chart providing an overview of the disclosed method is shown in Figure 1A and an overview of an embodiment of the disclosed method is shown schematically in Figure 1B. A flow chart showing the disclosed method is shown in Figure 1C. The inventors have identified that, as a result of the second fraction being enriched for polynucleotides lacking an affinity label, when the polynucleotides of the second fraction are sequenced, poor coverage of the polynucleotide sample is obtained in regions comprising unmodified nucleotides, which may be epigenetically significant regions of the sample. - 8o -
[0448] As indicated in Figure 1C, by pooling the second (unlabelled i.e. modified) fraction with a subset of the first (labelled i.e. unmodified) fraction, a pooled fraction maybe produced that is representative of the initial polynucleotide sample. The production and sequencing of a pooled fraction in this way has been found to allow the generation of sequencing information providing excellent and unbiased coverage of the polynucleotide sample as a whole. The pooled fraction maybe used for genetic analysis either by whole genome sequencing or by targeted enrichment using a panel of bait oligonucleotides to enrich e.g. the exome. In the embodiment shown in Figure 1B, a bacterial DNA methyltransferase enzyme (M.Mpel) is used to target unmodified CpG sites for modification with an unnatural cofactor analogue of S-adenosyl-L-methionine referred to herein as ETA-AdoHcy-N3.
[0449] Incubation of the methyltransferase, DNA and cofactor for one-hour results in complete modification of the target DNA, which is functionalised with azide- terminating tags. These tags can be further modified (for example, biotinylated) to enable fractionation of modified and unmodified DNA (where ‘unmodified’ refers to all genomic DNA fragments containing one or more CpG dinucleotide that is unmodified (for example, that is not methylated, hydroxymethylated, carboxylated, or acylated) at the C5-position).
[0450] The inventors have developed a modified library preparation that integrates this labelling step, thereby minimising handling and purification steps, and as a result, improving robustness and maximising sensitivity.
[0451] Example 1
[0452] A significant advantage of the disclosed method is that it employs a single clean-up step for the entire process, dramatically improving the efficiency of the fractionation. This is made possible by the use of, firstly, a step to inactivate the methyltransferase and, secondly, the surprising activity of the enzymes for library preparation in the resulting buffer mixtures.
[0453] Figure 2 shows that the efficiency of adapter ligation is significantly inhibited in the absence of inactivation of the methyltransferase after labelling and before adapter ligation. This surprising result is due to the high binding affinity of the methyltransferase enzyme to the DNA molecule, which has been found to limit the activity of DNA-targeting enzymes in subsequent steps of the procedure. This activity can be recovered by inactivating the methyltransferase enzyme.
[0454] Example 2 Figure 3 shows the results of example experiments investigating the recovery of DNA samples with different affinity labels and comprising different CpG densities.
[0455] In the experiment of Figure 3A, a mixture containing 100 ng of DNA carrying a known number of CpG sites (o, 1, 2, 4 or 10) was incubated with M.Mpel (0.0274 pg / pL) and ETA-AdoHCy-N3 (100 pM) . The reaction was incubated at 37 °C for 1 hour. The DNA was purified using AMPure beads (Beckman Coulter), followed by conjugation of an affinity label comprising biotin using click chemistry. Finally, DNA was purified using a standard PCR clean-up kit (Zymo Clean and Concentrate). Purified DNA was fractionated using DynaBeads MyOne Streptavidin-coated beads.
[0456] The beads were then washed twice with 150 pL of PBST. Finally, captured DNA was released. As shown in Figure 3A, capture efficiency is improved by the current method (grey bars), relative to the method reported by Kriukiene et al. (Nature Communications 20134:2190), Figure 3C.
[0457] Figure 3A (blue bars) shows the release efficiency of DNA in the current workflow (Active-Seq).
[0458] The disclosed method provides a significant improvement in capture efficiency relative to the method described in Kriukiene et al. (Nature Communications 20134:2190). As shown in Figure 3B (reproduced from Figure 2b of Kriukiene et al.), Kriukiene at el. report capture efficiencies in the 20-30% range using a method comprising an azide- DBCO label. This is significantly lower than the capture efficiencies that maybe obtained using the method disclosed herein, as shown, for example, in Figure 3A.
[0459] The method described by Kriukiene et al. shows (in Figure 2c, reproduced herein as Figure 3C) capture of around 30-40% of target DNA containing 2 CG sites using an azide-DBCO affinity label and streptavidin-coated magnetic beads. In contrast, while the method disclosed herein is able to isolate DNA at similar input levels to those of the method described by Kriukiene et al., the captured DNA maybe recovered from the capture agent in much more significant proportions, at least in part due to the efficient release of the labelled DNA molecules (see Figure 3D). Kriukiene et al. only includes data on the level of DNA capture, and there is no discussion or data in Kriukiene et al., on the efficiency of release of the sample from the magnetic beads. The present inventors have found that using the method disclosed by Kriukiene et al., the release of enriched DNA fragments is highly inefficient and inconsistently reproducible. The efficient enrichment of DNA that is rich in CpG sites is critical for even representation of the (enriched) genome in the sequencing experiment. CpG-rich regions often lie in important regulatory regions of the genome. Figure 3E shows a plot of mean normalised read count per million reads (NRPM) for samples prepared using the method described in Kriukiene et al. (left-hand bars) or the method disclosed herein (right-hand bars), as a function of the number of CpG sites in a given read. The plot is generated for CpG-rich regions of the genome (CG islands).
[0460] Figure 3E clearly shows higher read densities across CpG rich regions, demonstrating the significantly improved enrichment of CpG-rich DNA using the disclosed method.
[0461] The overall effect of the method disclosed herein is to enable efficient enrichment of unmodified DNA from as little as a few picograms of input DNA. This is particularly critical for samples where the DNA concentration is limited, such as liquid biopsy (blood, urine, saliva, spinal fluid) samples.
[0462] Example 3
[0463] Experiments were performed to establish how the enrichment / fractionation platform performs as a function of DNA concentration, at input amounts consistent with cfDNA and single cell analyses; and the linearity of the enrichment efficiency across DNA molecules with a range of (unmodified) CpG site densities. The inventors have found that this latter issue was a particular limitation of the method described by Kriukiene et al., which resorted to dilution of the methyltransferase enzyme in the labelling reaction to limit the number of (relatively insoluble) DNA modifications introduced to a single DNA molecule. Between 1.25 ng and 1.25 pg (equivalent to between 200 and 1 / 3 of a copy of the human genome) of target DNA (153 bp, containing 10 unmodified CpG sites) were spiked into a background of 24 ng of non-target DNA (142 bp containing no CpG sites). The target DNA was tagged and thereby enriched using streptavidin coated beads for analysis by qPCR, the results of which are shown in Figure 4.
[0464] Enrichment efficiencies in excess of 80% were obtained for all of the spike-in samples, clearly demonstrating the compatibility of the approach for enrichment of DNA at input levels consistent with single-cell analysis. Tagged DNA is compatible with PCR and can be amplified using a standard polymerase, following enrichment.
[0465] As shown in Figure 3A, the initial step of enrichment (capture of labelled DNA, for example, by streptavidin-coated beads) shows only a very minor dependence on the number of CpG sites available on a DNA molecule. Light grey bars show capture efficiencies and dark grey bars show capture / release of target DNA from solution.
[0466] Example 4
[0467] Having demonstrated the performance of the biochemical approach on simple DNA fragments, the utility of the platform disclosed herein on genomic DNA was investigated by generating genome-wide epigenetic profiles from DNA extracted from a range of cell lines. Extracted DNA was fragmented by sonication (-150 bp) and subject to enrichment of the DNA fragments lacking CpG modification. Samples were sequenced using an Illumina NovaSeq platform (Source Biosciences) to approximately 120M reads per sample. For the enriched, unmodified DNA fraction, saturation analysis shows that data reaches 90% saturation between 65M and 90M reads. Initial quality control using MultiQC showed low levels of read duplicates (—15%) and an average enrichment in GC-content of the genome, from 41% in the nascent human genome to an average of -47% in enriched samples, consistent with enrichment at regions rich in CpG dinucleotides, such as CG islands (lung cancer derived cell lines showed higher average GC content, with a mean of approximately 54%).
[0468] Example 5 Successful enrichment at CpG sites was assessed by comparison of the enriched (unmodified CpG) and unenriched (modified CpG) fractions of the genome by sequencing. This was done by examining the fraction of reads containing a CpG site and the sequencing coverage at each CpG site. In the enriched fraction, 98.6% of the reads contain a CpG site. By contrast, in the unenriched fraction only 55.3% of reads contain a CpG site, indicating effective enrichment at CpG sites. Furthermore, in the enriched fraction, a majority of the CpG sites of the genome (54.0%) are covered by greater than
[0469] 5 reads, whereas in the unenriched fraction, this figure is just 6.3% of the CpG sites.
[0470] In the human genome, 70-80% of the CpG sites are modified (e.g. methylated) and, hence, in the disclosed method the enriched fraction of the sample might be expected to be focussed only on the remaining 20-30% of the genome’s CpG sites. Despite this, over half of the genomic CpG sites were found to have ‘high’ coverage (> 5-fold) in the enriched (unmodified) fraction (light grey bars), as shown in Figure 5. This is likely due, at least in part, to the high efficiency of the disclosed method. For enriched DNA, typically between 1 and 5% of reads were found to contain no CpG sites. The source of these reads is likely varied but will include non-specifically enriched DNA, as well as reads that do not cover a motif but that originate from a molecule that does (specifically enriched but CpG not sequenced). This read fraction is denoted as the ‘background’ for the enriched sample.
[0471] To further understand the composition of the enriched DNA fraction, sequencing coverage of CpG sites was compared at known modified and unmodified CpG sites (as determined by whole genome bisulfite sequencing (WGBS)). The results are shown in Figures 6A and 6B. As defined herein, an unmodified (e.g. ‘unmethylated’) site is a site having a modification (e.g. methylation) level (P-value) of less than 0.05 by whole genome bisulfite sequencing. A ‘modified’ (e.g. ‘methylated’) site has a modification (e.g. methylation) level (P-value) of greater than 0.95 by whole genome bisulfite sequencing. A total of 430,245 CpG sites met the definition of an ‘unmodified CpG site’ (< 5 % modified by WGBS). Significant enrichment was observed at these sites, as judged by the high coverage of sites in the enriched fraction (70% of sites (~3oi,ooo CpG sites) have greater than 5-fold coverage) as compared to the unenriched fraction of the sample (less than 1% (-1500 sites) have greater than 5-fold coverage). This is in good agreement with the initial validation of the approach, confirming that where CpG sites are unmodified, efficient enrichment is seen using the disclosed method. There are 53,081 CpG sites in the genome defined as ‘modified’ (> 95 % modified by WGBS). Similar coverage of these sites is observed in both enriched and unenriched fractions of the sample, indicating that little enrichment of DNA occurs at these highly modified sites. Hence, where CpG sites are modified, little enrichment of these sites is seen using the disclosed method.
[0472] Example 6
[0473] As further validation of the disclosed method, the inventors sought to understand the enrichment of genomic DNA, as a function of unmodified (as determined by whole genome bisulfite sequencing) CpG site density. This analysis mirrors the validation experiments using DNA molecules of known sequence and CpG site density (as shown in Figure 4) but for enrichment using genomic DNA. The results are shown in Figures 7 A and 7B.
[0474] A remarkably similar enrichment profile was observed for genomic DNA, as for the model DNA fragments with known CpG site densities with efficient enrichment of DNA, even where only one or two unmodified CpG sites are available. The enrichment profile for the disclosed method shows consistency across the range of unmodified site densities (Figure 7A). This is in stark contrast to the analogous experiment for MeDIP-
[0475] Seq for highly modified DNA molecules, Figure 7B. For MeDIP-Seq, no significant enrichment of DNA molecules was observed (relative to the WGS baseline) with a CpG density of less than 3 sites in a 75 bp genomic window; and a significant bias towards enrichment of densely modified regions of the genome (> 6 CpG sites per 75 bp).
[0476] In all, these results are consistent with the initial validation of the disclosed approach, which demonstrates exceptionally high enrichment efficiencies for unmodified CpG sites; near uniform efficiency across a range of CpG densities and no significant off- target enrichment of DNA in the disclosed method.
[0477] Example 7
[0478] A number of studies were conducted to compare the disclosed method to other (epi)genomic analyses. An advantage of the disclosed method is provided by the enzymatic targeting of unmodified CpG sites. The approach is well-suited to the enrichment of hypomethylated DNA from tumour cells in the blood. A key genomic feature that were hypothesised to be prominent in the profiles produced by the disclosed method are extended regions of unmodified DNA, which are epigenetically-stable and conserved in mammals, with consistently low unmodified levels on length scales of 5-20 kbp. The term ‘non-modified island’ (NMI) is used herein since such unmodified regions are rather more island-like in the profiles produced by the disclosed method.
[0479] Read count peaks in profiles of unmodified DNA produced by the disclosed method (bottom) were found to anticorrelate to those observed in MeDIP-Seq (top) and correlate with regions of low modification, identified in whole genome bisulfite sequencing (yellow), as shown in Figure 8.
[0480] At the gene-level, read counts of unmodified DNA using the disclosed method show peaks centred at CG islands, that span the broader, regulatory regions of genes and are consistent with NMIs, as shown in Figure 8A. NMIs are thought to exist to reduce mutation rates in functionally-important genomic regions.
[0481] Deamination of methylated cytosine, which converts to thymine, has been shown to be the most frequent mutation in human cancers. NMIs play a central role in regulation of gene expression and their methylation levels are regulated by the TET (demethylating) enzymes, via the polycomb protein complex. In agreement with this, genome-wide analysis shows significant enrichment of unmodified DNA around transcription start sites, relative to the analogous MeDIP-Seq experiment, as shown in Figure 8A. On the scale of hundreds of kbp-to-Mbp, anticorrelation of the profile of unmodified DNA produced by the disclosed method (bottom) to both MeDIP-Seq (top) and WGBS (middle) is retained, as shown in Figure 8A. A particularly striking aspect of the profile produced by the disclosed method (bottom) is the presence of clear domains of modified and unmodified genomic regions, that are consistent with the expected correlation between genomic modification (e.g. methylation) levels and genome organisation (Figure 8A).
[0482] Example 8
[0483] For further validation, the disclosed method was compared to established sequencing approaches that are known to correlate with DNA methylation levels, as shown in
[0484] Figure 8. Clear regions of highly unmethylated DNA, also evident in the MeDIP-Seq and WGBS profiles (Figure 8A). Enrichment in the profile produced with the disclosed method at transcription start sites and markers of active chromatin (H3K4Mei, H3K4Me3 and H3K2 ac) anticorrelates with loss of MeDIP-Seq signal at these regions (Figure 8B and Figure 9).
[0485] These regions of low / high methylation are correlated with defined structural domains of the genome identified in Hi-C experiments, as shown in Figure 8C.
[0486] Example 9 Having demonstrated the potential of the disclosed method to generate meaningful genome-wide epigenetic profiles, the approach was applied to nine DNA samples derived from cultured cell lines for a range of cancers (breast (MCF7, HCC1937), colorectal (HT29, SW48, C0I0201, RKO), liver (HepG2) and lung (SW1271, NCI- H2170)).
[0487] Genome-wide correlation analysis of this dataset shows excellent correlation of a series of three technical repeats for each of the samples. Each of the cell lines examined forms a distinct cluster of correlated data, with cell lines from similar tissues broadly clustering together, consistent with the expectation that the epigenetic profile can be employed for the identification of tissue of origin for a sample. These distinct cell-line- specific profiles result in part from the robustness of the method and the remarkable consistency of the disclosed method across sequencing runs and operators.
[0488] In order to establish the potential of the disclosed method as a method for the diagnosis of cancer, a series of experiments were performed using tumour tissue and normal adjacent tissue from six patients with different cancers. Here, whole genome correlation analysis provides a simple overview that clearly highlights the discriminative ability of the disclosed method for both disease diagnosis and derivation of tissue of origin from a patient sample (Figure 10).
[0489] This approach was extended to better understand the specific regions of the genome that give rise to differences in epigenetic profiles and their link to known biological function. The profiles produced by the disclosed method of tumour and normal adjacent tissue were compared for a patient in 75 bp windows, across the whole genome. This analysis requires no a priori knowledge of a patient’s genomic sequence and makes no assumptions about regions of interest. By doing so, tens or hundreds of thousands of differentially modified (e.g. methylated) regions were identified in the profiles produced by the disclosed method for each patient. These are summarised in a series of volcano plots, shown in Figure 12. The results show the relative statistical significance and, critically, the population of modified (e.g. hypo- and hypermethylated) regions identified in each patient (the disclosed method does not solely focus on discovery of unmodified regions of the genome). Examples of individual profiles produced using the disclosed method for the tissue lung cancer sample are shown at genes that have been implicated in relevant cancer pathways in the literature. Example 10
[0490] DNA shed from tumour cells can be isolated in the blood of cancer patients. However, the technical challenge associated with its analysis is two-fold; DNA it is typically present in healthy and early-stage cancer patients at less than 10 ng per millilitre of plasma; and cell free DNA isolated from plasma can contain less than 1% tumour fraction (ctDNA).
[0491] The disclosed method is ideally suited to the analysis of ctDNA because it is performant with input DNA orders of magnitude less than one nanogram and provides genomewide analysis.
[0492] To demonstrate the suitability of the disclosed method for the analysis of ctDNA, two patient samples were prepared for analysis, one a healthy patient and one from a patient diagnosed with stage 1 lung cancer (non-small cell lung cancer). DNA was extracted from 3mL of plasma using an automated platform (Informed Genomics), which returned 47uL of DNA at a concentration of 0.50 ng / pL and 0.55 ng / pL for the healthy and cancer patient, respectively (Figure 12A and 12B).
[0493] The method was performed in duplicate (separate preparations and sequencing runs) with 8.9 ng and 7.2 ng input DNA for the healthy patient and 10 ng input on both occasions for the lung cancer patient. Both input DNA samples and the output of the enriched library maintain the fragment size distribution that is characteristic of nucleosomal cell-free DNA, Figure 12.
[0494] Duplication rates for the sequencing data were 9.2% and 9% for the larger sequencing run, with coverage of 100M reads for both samples (Figure 13). This represents a significant improvement on duplication rates typical for approaches using base conversion, which can reach 30-40%. The background, defined as the percentage of reads lacking a CpG site in the dataset, was 4.5% and 3.6% for the healthy and cancer samples, respectively. Example 11
[0495] Formalin fixed, paraffin-embedded (FFPE) treatment typically leads to extensive damage (depurination, depyrimidation and deamination) of the genome. The ability to generate a meaningful epigenomic profile from DNA preserved in these samples using the disclosed method was investigated.
[0496] Genome-wide profiles were generated using the disclosed method for three FFPE embedded samples in triplicate, sourced from the Welsh Cancer Bank, derived from patients with colorectal cancer. Consistent with other sample types, sequencing reached 90% saturation by 80M (150 bp, paired-end) reads for all samples. The resultant datasets show good overall coverage of the genome and excellent consistency for the technical repeats (Figure 14).
[0497] For two of the three samples, high levels of relative enrichment of CpG dense regions of the genome were observed, (Figure 14). However, comparison of the three FFPE datasets to similar data for the HT-29 cell line shows good consistency of the profile and the observed ‘background’ of the sequencing dataset (reads lacking CpG sites) is consistently below 5% for all samples. Such an increase in the relative enrichment of regions that are dense in CpG sites is consistent with the expected damage of CpG sites by the FFPE treatment. This likely stems from a relative reduction of the concentration of enrichable DNA molecules with few CpG sites in the sample. Conversely, those molecules with many CpG sites retain a few taggable CpG sites, post FFPE treatment.
[0498] Example 12
[0499] The enriched (labelled i.e. unmodified) and unenriched (unlabelled i.e. modified) fractions were subjected to a second round of enrichment, using a targeted genetic panel (IDT xGen Pan-cancer hyb panel), as shown in rows B and C of Figure 15.
[0500] Targeted enrichment of the unmodified fraction gives rise to deep sequencing at promoter regions of the genome, whereas the vast majority of the DNA from the gene body is in the modified fraction of the sample and is captured by the targeted panel for targeted analysis of common cancer mutations, as shown in Figure 15. Materials and Methods
[0501] A typical workflow is set out below. In DNA samples requiring DNA fragmentation, DNA was sheared to an average of 180 bp.
[0502] A mixture of DNA (<iong), M.Mpel and ETA-AdoHCy-N3 was prepared on ice. This solution was incubated at 37 °C for 1 hour. Following incubation, the methyltransferase enzyme was inactivated by heating.
[0503] Without purification, the sample was cooled to io°C. End Repair & A-Tailing Master Mix was added (Kapa Biosystems). The mixture was mixed thoroughly by pipette aspiration and incubated at 20°C for 30 mins followed by a 65°C incubation for a further 30 mins.
[0504] The sample was cooled to io°C, and sequencing adapters were ligated.
[0505] Without purification, biotin-PEGq-DBCO (Jena Biosciences) was added and the mixture was incubated at 37 °C for 1 hour with shaking at 500 rpm. The DNA was subsequently purified from the reaction mixture.
[0506] 5 pL Dynabeads MyOne Streptavidin Ci beads (ThermoFisher) were washed with 150 pL of PBST. The DNA was added to the beads and the mixture was further incubated at 23 °C for 15 minutes, with shaking at 1000 rpm. Once completed, the supernatant was removed (as the “second fraction”), and the beads were washed twice with 150 pL of PBST. Finally, the bound DNA was released from the beads (as the “first fraction”) by denaturation of streptavidin. Amplified libraries were pooled together with 0.1% PhiX and sequenced on a S4 Flow
[0507] Cell using an Illumina NovaSeq Sequencer (Source Biosciences). qPCR was performed on the Azure Cielo 6 thermocycler (Azure Biosystems) with the following conditions: initial denaturation at 98°C for 30 seconds, then 40 cycles at 95°C for 10 seconds and 6o°C for 60 seconds with fluorescence detection. Analysis of the acquired fluorescence intensity and subsequent quantification of DNA in the samples was performed using Azure Cielo Manager Analysis Software (V1.0.4).
[0508] After sequencing, adaptors were removed from the reads using BBTools and then aligned to human reference genome HG38 using BWA-MEM2. Ambiguously aligned reads and those with low mapping scores (MAPQ score < 40) were removed using SamTools. Duplicates were removed with Sambamba (PMID: 25697820) and reads hard-clipped using jvarkit (https: / / github.com / lindenb / jvarkit). Spearman correlation plots were generated from the processed bam files with deepTools using a binsize of looobp and RPGC normalisation. Saturation figures and CpG density plots were generated for Chri-22 using the QSEA and Repitools R packages. To allow direct comparison of enriched and unenriched samples, Bam files were down sampled to the same sequencing depth using SamTools. High confidence methylated and unmethylated CpG sites used for comparison were taken from the consensus of two whole genome shotgun bisulphite sequencing (WGBS) datasets performed on cell line NA12878 by the same lab (www.encodeproject.org / experiments / ENCSR89oUQO / ), where less than 5% methylation in both datasets was considered to be unmethylated and greater than 95% methylation in both datasets was considered to be methylated.
Claims
Claims1. A fractionated sequencing library produced from a polynucleotide sample, wherein the sequencing library comprises: (i) separate first and second subsets of polynucleotides that are derived from a first fraction of the polynucleotide sample, wherein the first fraction is enriched for polynucleotides comprising an affinity label, wherein the affinity label is bound to a tag that is bound site-specifically to a nucleotide residue in each polynucleotide; and(ii) a second fraction of polynucleotides that is enriched for polynucleotides lacking an affinity label.
2. A fractionated sequencing library as claimed in claim 1, wherein the tag has been applied by a methyltransferase enzyme using a methyltransferase cofactor analogue, wherein the methyltransferase enzyme is configured to modify a nucleotide residue in a target position to apply the tag to each unmodified nucleotide residue in a polynucleotide, wherein each unmodified nucleotide residue is unmodified in the target position.
3. A fractionated sequencing library as claimed in claim 1 or 2, wherein:(i) the first subset of the first fraction; and / or(ii) the second fraction; have undergone target enrichment for one or more genome regions of interest.
4. A fractionated sequencing library as claimed in any of claims 1-3, wherein the polynucleotide sample has not been amplified, and the fractionated sequencing library is an unamplified fractionated sequencing library.
5. A fractionated sequencing library as claimed in any of claims 1-4, wherein the polynucleotides of the sequencing library comprise an indexing barcode.
6. A fractionated sequencing library as claimed in any of claims 1-5, wherein the polynucleotide sample is a cfDNA sample.
7. A method for making a fractionated sequencing library from a polynucleotide sample, the method comprising:(i) using a methyltransferase enzyme configured to modify a nucleotide residue in a target position to apply a tag to each unmodified nucleotide residue in a polynucleotide of the sample, wherein each unmodified nucleotide residue is unmodified in the target position;(ii) inactivating the methyltransferase;(iii) preparing the polynucleotide sample into a sequencing library;(iv) binding an affinity label to each tag; (v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for polynucleotides comprising an affinity label, and wherein the second fraction is enriched for polynucleotides lacking an affinity label; and(vi) producing separate first and second subsets of the first fraction.
8. A method as claimed in claim 7, wherein the method further comprises:(vii) forming a pooled fraction comprising:(a) at least a portion of the second subset of the first fraction, but not the first subset; and(b) at least a portion of the second fraction.
9. A method as claimed in claim 7 or 8, wherein the method further comprises target enrichment for one or more genome regions in:(i) the first subset of the first fraction;(ii) the second fraction; and / or (iii) the pooled fraction.
10. A method as claimed in any of claims 7-9, wherein the method does not comprise amplifying the polynucleotides before fractionation.
11. A method as claimed in any of claims 7-10, wherein:(a) the polynucleotide sample is prepared into an affinity labelled sequencing library (steps i-iv) prior to fractionation; and(b) the sequencing library is fractionated (step v) prior to producing separate first and second subsets of the first fraction (step vi).
12. A method as claimed in any of claims 7-11, wherein the preparation of an affinity labelled sequencing library (steps i-iv) is performed in a one-pot approach and involves at most one sample purification step prior to fractionation.
13. A method as claimed in any of claims 7-12, wherein binding an affinity label to each tag comprises adding an affinity label precursor directly into the sequencing library preparation mixture, without a washing step.
14. A method as claimed in any of claims 7-13, wherein using a methyltransferase enzyme to apply a tag to each unmodified nucleotide residue comprises the use of a methyltransferase cofactor analogue.
15. A method as claimed in any of claims 7-14, wherein the polynucleotide sample comprises DNA, and wherein the target position of the methyltransferase enzyme is selected from the group consisting of the cytosine C5 position, the cytosine N4 position, and the adenine N6 position.
16. A method as claimed in any of claims 7-15, wherein the affinity label comprises biotin, and wherein fractionating the sequencing library into first and second fractions comprises fractionation using a capture agent comprising a biotin-binding protein.
17. A method as claimed in any of claims 7-16, wherein preparing the polynucleotide sample into a sequencing library comprises end repair, A-tailing, and adapter ligation of the polynucleotides in the sample.
18. A method as claimed in any of claims 7-17, wherein amplifying the polynucleotides further comprises the incorporation of an indexing barcode into the polynucleotides.
19. A fractionated sequencing library, wherein the fractionated sequencing library is obtained or obtainable from a polynucleotide sample by a method as claimed in any of claims 7-18.
20. A method for determining the presence of a genetic mutation in a polynucleotide sample, the method comprising:(i) obtaining a fractionated sequencing library as claimed in any of claims 1-5 or 19, or preparing a fractionated sequencing library using a method as claimed in any of claims 7-18;(ii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the polynucleotide sample; and(iii) sequencing the polynucleotides of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the polynucleotide sample.
21. A method as claimed in claim 20, wherein the method comprises amplifying the polynucleotides before forming the pooled fraction, and wherein the method further comprises determining the proportion of the initial polynucleotide sample that is present in the first fraction and adding a corresponding quantity of the amplified second subset to at least a portion of the second fraction to form a pooled fraction that has a polynucleotide content that is representative of the initial polynucleotide sample.
22. A method as claimed in claim 20 or 21, further comprising determining the modification status of nucleotide residues in the polynucleotide sample, wherein the method comprises:(iv) sequencing the polynucleotides of the first subset of the first fraction, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample.
23. A method as claimed in any of claims 20-22, wherein the method further comprises target enrichment of the pooled fraction for one or more genome regions of interest in:(i) the first subset of the first fraction;(ii) the second fraction; and / or (iii) the pooled fraction, prior to sequencing.
24. A method as claimed in claim 23, wherein the method further comprises amplification of the polynucleotides of the pooled fraction after target enrichment.
25. A method as claimed in any of claims 20-24, wherein the polynucleotide sample comprises DNA, and wherein the target position of the methyltransferase enzyme consists of the cytosine C5 position, the method comprising:(i) using a methyltransferase enzyme configured to modify the cytosine C5 position of a CpG dinucleotide to apply a tag to each unmodified cytosine residue of the sample, wherein each unmodified cytosine residue is the cytosine of a CpG dinucleotide that is unmodified in the C5 position;(ii) inactivating the methyltransferase;(iii) preparing the DNA sample into a sequencing library; (iv) binding an affinity label to each tag;(v) fractionating the sequencing library into first and second fractions, wherein the first fraction is enriched for DNA molecules comprising an affinity label, and wherein the second fraction is enriched for DNA molecules lacking an affinity label;(vi) producing separate first and second subsets of the first fraction; (vii) combining at least a portion of the second subset with at least a portion of the second fraction to form a pooled fraction that comprises the complete nucleotide sequence of the DNA sample; and(viii) sequencing the DNA of the pooled fraction, and using the sequencing information to determine the presence of a genetic mutation in the DNA sample.
26. A method as claimed in claim 25, wherein the methyltransferase enzyme is a C5 methyltransferase, and wherein the methyltransferase cofactor analogue is ETA- AdoHcy-N3. orj. A method as claimed in claim 26, wherein the methyltransferase enzyme consists of or comprises a variant or fragment of the wild type M.Mpel sequence, comprising at least 80% sequence identity to the wild type M.Mpel sequence having the NCBI accession number BAC44284.
28. A method as claimed in any of claims 25-27, wherein the method further comprises sequencing the DNA of the first subset, and using the sequencing information to determine the modification status at the cytosine C5 position of each CpG dinucleotide in the DNA sample.
29. A method as claimed in claim 28, wherein the method is a method of preparing a profile of the modification status of nucleotide residues and / or any genetic mutationsin the polynucleotide sample, wherein the method further comprises comparing the sequences of the sequencing reads of the pooled fraction and / or the first subset to a reference sequence to determine the location of the sequencing reads within the reference sequence and thereby the presence or otherwise of unmodified nucleotide residues and / or genetic mutations at specific locations within the sequence.
30. A method of diagnosing disease in a subject, the method comprising preparing a profile of a polynucleotide sample obtained from the subject by a method as claimed in claim 29, and diagnosing a disease based on the profile of the polynucleotide sample.
31. A method of determining the modification status of nucleotide residues in a polynucleotide sample, the method comprising:(i) obtaining an amplified fractionated sequencing library as claimed in any of claims 1-6 or 19, or preparing an amplified fractionated sequencing library using a method as claimed in any of claims 7-18; and(ii) sequencing the polynucleotides of the first subset, and using the sequencing information to determine the modification status of nucleotide residues in the polynucleotide sample.