Methods and systems for phased genome assembly

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The method assembles phased genomes using k-mers from a reference panel, enhancing efficiency and accuracy in detecting linked variants and structural variations, addressing inefficiencies in traditional genomic sequencing.

WO2026136378A1PCT designated stage Publication Date: 2026-06-25DOVETAIL GENOMICS LLC

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: DOVETAIL GENOMICS LLC
Filing Date: 2025-12-16
Publication Date: 2026-06-25

Application Information

Patent Timeline

16 Dec 2025

Application

25 Jun 2026

Publication

WO2026136378A1

IPC: G16B30/20; G16B20/20; G16B50/00

AI Tagging

Application Domain

Proteomics Genomics

Technology Topics

Haplotype Bioinformatics

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing genomic sequencing methods are inefficient and time-consuming for obtaining haplotype-resolved genomes, which are crucial for disease prediction, donor-host matching, and understanding genetic variations.

Method used

A method utilizing k-mers from a reference genome panel to assemble a target genome without mapping to a reference genome, leveraging haplotype-resolved reference genomes and generating ordered lists of k-mers to align and assemble genomes, enabling linkage information and phasing.

Benefits of technology

This approach allows for accurate and efficient assembly of phased genomes, detecting linked variants and structural variations with high sensitivity, reducing reliance on traditional mapping methods and improving disease association studies.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure US2025059872_25062026_PF_FP_ABST

Patent Text Reader

Abstract

Provided herein are methods of assembling phased genomes. The methods may comprise using a reference genome panel. The methods may comprises generating a set of k-mers corresponding to nucleic acid variants in a reference genome. Sequencing may be performed on a subject to identify long range linkage information. Sequence reads may be analyzed for the presence of k-mers, and genome may be assembled via determination of an order of k-mers.

Need to check novelty before this filing date? Find Prior Art

Description

Attorney Docket No. 45269-750.601METHODS AND SYSTEMS FOR PHASED GENOME ASSEMBLYCROSS REFERENCE

[0001] This application claims the benefit of U.S. Provisional Application No. 63 / 735,706, filed December 18, 2024, which is incorporated herein by reference in its entirety.BACKGROUND

[0002] Genomic sequences are useful for identifying markers or variants that may be related to diseases or other genetic trait of interest. Obtaining a genome for an individual may provide invaluable information for diagnoses of diseases or recommendations for specific treatment regimens. However, obtaining genomes for a given subject can be time consuming and inefficient.SUMMARY

[0003] In an aspect, the present disclosure provides methods for obtaining an assembled genome. The genome may be a phased assembled genome comprising haplotype information.

[0004] In an aspect, the present disclosure provides a method of genome assembly comprising: (a) obtaining (i) a plurality of sequencing reads derived from a subject’s genome, and (ii) a set of ordered lists of k-mers derived from a panel of reference genomes, wherein each k-mer comprises a nucleotide sequence comprising a variant; (b) identifying sequence reads comprising a pair of k-mers of one of the set of ordered lists of k-mers; (c) generating an assembled genome by aligning the pair of k-mers of the sequence reads identified in (b) to k-mers in the set of ordered lists of k-mers.

[0005] In some embodiments, (c) comprises identifying an ordered list of k-mers from the set of ordered lists of k-mers that comprises the most k-mers from the plurality of sequence reads. In some embodiments, (c) further comprises identifying a sequence from the panel of reference genomes that corresponds with the ordered list of k-mers that aligns with k-mers of the sequence reads identified in (b), thereby generating an identified reference sequence. In some embodiments, (c) further comprises repeating said aligning and said identifying sequences of said reference genome for additional subsets of said ordered list of k-mers to determine a plurality of identified reference sequences. In some embodiments, (c) further comprises combining said plurality of identified reference sequence to generate said assembled genome. In some embodiments, (c) comprises denoting the pair of k-mers as linked k- mers. In some embodiments, the denoting comprises generating a graph data structure wherein said k- mers of the pair of k-mers are connected nodes. In some embodiments, each ordered list of k-mers comprises a plurality of k-mers of a chromosome.

[0006] In some embodiments, the variants are bi-allelic variants. In some embodiments, each k-mer of said set of ordered list of k-mers comprises no more than one variant. In some embodiments, each k-mer sequence of said set of k-mer sequences is present at no more than one time in each reference genome. In some embodiments, the method further comprises (d) assembling the sequences of the plurality of sequencing reads into contigs; and (e) aligning the contigs to the assembled genome to identify additionalAttorney Docket No. 45269-750.601 variants. In some embodiments, the assembled genome comprises sequences derived from said panel of reference genomes.

[0007] In some embodiments, the method further comprises, prior to (a), sequencing a subject’s genome to generate the plurality of sequencing reads. In some embodiments, the sequencing reads are derived from nucleic acids subjected to a proximity-ligation reaction. In some embodiments, the plurality of sequencing reads is generated by cross-linking a sample derived from said subject, fragmenting nucleic acids in said sample to produce nucleic acid fragments, ligating said nucleic acid fragments to produce ligated nucleic acid fragments, reversing crosslinks, and sequencing said ligated nucleic acid fragments. In some embodiments, the fragmenting comprises contacting the nucleic acids with a restriction endonuclease, a micrococcal endonuclease, atransposase or DNase I. In some embodiments, fragmenting comprises contacting the nucleic acids with DNase I.

[0008] In an aspect, the present disclosure provides a system for genome assembly, the system comprising: at least one system memory comprising (i) a plurality of sequencing reads derived from a subject’s genome, and (ii) a set of k-mer sequences derived from a reference genome panel, wherein each k-mer comprises a nucleotide sequence comprising a variant; and a computer memory coupled to said at least one system memory and programmed to: (a) identify sequence reads comprising a pair of k-mers of one of the ordered lists of k-mers; (b) generating an assembled genome by aligning the k-mers of the sequence reads identified in (a) to k-mers of the ordered lists of k-mers.

[0009] In another aspect, the present disclosure provides a computer readable medium or mediums for detecting a presence or an absence of cancer in a subject, wherein said computer readable medium or mediums comprising a set of instructions recorded thereon, wherein, when said set of instructions are executed by a processor, the following steps are implemented: (a) obtaining: (i) a plurality of sequencing reads derived from a subject’s genome, and (ii) a set of ordered lists of k-mers derived from a panel of reference genomes, wherein each k-mer comprises a nucleotide sequence comprising a variant; (b) identifying sequence reads comprising a pair of k-mers of one of the ordered lists of k-mers; (c) generating an assembled genome by aligning the k-mers of the sequence reads identified in (b) to k-mers of the ordered lists of k-mers.INCORPORATION BY REFERENCE

[0010] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.BRIEF DESCRIPTION OF THE DRAWINGS

[0011] An understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings of which:Attorney Docket No. 45269-750.601

[0012] FIG. 1 shows a computer system that is programmed or otherwise configured to implement methods provided herein.

[0013] FIG. 2 illustrates a schematic of the experimental methods described herein.

[0014] FIG. 3 shows a pie graph of sample types that may be supported by the methods and systems described herein.

[0015] FIG. 4 displays contact maps and quality scores for the fish sample.

[0016] FIG. 5 illustrates structural variation captured by the methods and systems described herein that was missed by WGS and RNA-seq.

[0017] FIG. 6 illustrates an output screen for an assay using the methods described herein.

[0018] FIG. 7A shows an IDT xGen Exome Panel, IDT xGen CNV Panel, or whole exome sequencing.

[0019] FIG. 7B shows capture panels integrated into the data and aligned to detect SNV / INDELs.

[0020] FIG. 8 shows graphs of coverage and clip consensus assembly.

[0021] FIG. 9 displays a graph of genomic coverage around a breakpoint from the matrix visualization method of breakpoint detection.

[0022] FIG. 10 shows example CUSUM plots used for breakpoint detection.

[0023] FIG. 11 shows soft and hard clipped reads, and various breakpoints that have been detected.

[0024] FIG. 12 displays a comparison between the matrix visualization and the CUSUM / clipping methods and also compares these methods to Sanger validation.

[0025] FIG. 13 displays results of an HCC1187 tumor / normal mixing experiment with exome sequencing and a CNV xGen™ panel.

[0026] FIG. 14 shows VAF distributions and limit of detection plots for the exome sequencing / CNV xGen™ experiment.

[0027] FIG. 15 shows that exome capture with the methods and systems described herein detected structural variations down to 20% tumor fraction and below.

[0028] FIG. 16 shows a small signal observable on the matrix plot of the 5% tumor sample.

[0029] FIG. 17 shows a sample with multiple structural alterations of unknown significance.

[0030] FIG. 18 shows a copy number ratio plot for the different chromosomes.

[0031] FIG. 19 shows a copy number ratio plot for part of chromosome 20.

[0032] FIG. 20 shows that the ovarian sample had a series of connected interchromosomal insertions.

[0033] FIG. 21 shows the chromosome 19 insertion sequences contained poor prognostic markers, such as XAB2, CCL25, SPINT2, and YIF1B.

[0034] FIG. 22 shows a schematic of the method used.

[0035] FIG. 23 shows a generic example of a graph.

[0036] FIG. 24 shows an example genomic string graph.

[0037] FIG. 25 shows an example of how a de Bruijn graph is built.

[0038] FIG. 26 shows an example of a variation graph.

[0039] FIG. 27 shows a high-level diagram of an HPRC pangenome graph.Attorney Docket No. 45269-750.601

[0040] FIG. 28 shows a schematic of the method herein.

[0041] FIG. 29 shows more information about an unphased haplotype in the ovarian tumor diploid genome (“Hapl”) as compared to a second unphased haplotype (“Hap2”) and the reference genome hg38.

[0042] FIG. 30 shows a schematic of a method used herein.

[0043] FIG. 31 shows an example workflow for a proof-of-concept study of an ovarian tumor.

[0044] FIG. 32A shows differences between the two unphased haplotypes for the largest 23 contigs.

[0045] FIG. 32B shows the total size, number of contigs, and total size of largest 23 contigs for the two unphased haplotypes, the diploid genotype (Unphased Hap 1 + Unphased Hap2), and the reference genome (hg38).

[0046] FIG. 33 shows a schematic of different locations that KRAS Exon 1 could be mapped to in the two different haplotypes.

[0047] FIG. 34 shows an example alignment for a personalized diploid genome for an ovarian tumor.

[0048] FIG. 35 shows the same alignment but with reads of mapping quality = 0 (MQ0).

[0049] FIG. 36 shows reads belonging to homologous regions in both chromosomes that are identical.

[0050] FIG. 37 shows the alignment where the reads appear to be “missing” represent regions of the genome that are different between the two copies of a chromosomal region (e.g., heterozygous / hemizygous) .

[0051] FIG. 38 shows how putative somatic variants were identified.

[0052] FIG. 39 shows phasing information within the diploid alignments.DETAILED DESCRIPTION

[0053] Provided herein are methods and systems for assembling phased genomes. A human genome consists of two homologous sets of chromosomes. As such, to understand the true genetic makeup of an individual, the maternal and paternal copies or haplotypes must be identified along with assembling the genome. To assemble a given genome for an individual, a full genome for each haplotype is needed to be generated. Traditional methods rely on the use of a standard human reference genome, in conjunction with mapping, aligning sequencing reads to the reference genome and using haplotype phasing methods to assemble a complete genome.

[0054] Obtaining a haplotype in an individual is useful in several ways. First, haplotypes are useful clinically in predicting outcomes for donor-host matching in organ transplantation and are increasingly used as a means to detect disease associations. Second, in genes that show compound heterozygosity, haplotypes provide information as to whether two deleterious variants are located on the same allele, greatly affecting the prediction of whether inheritance of these variants is harmful. Third, haplotypes from groups of individuals have provided information on population structure and the evolutionary history of the human race. Lastly, recently described widespread allelic imbalances in gene expression suggest that genetic or epigenetic differences between alleles may contribute to quantitative differencesAttorney Docket No. 45269-750.601 in expression. An understanding of haplotype structure can help delineate the mechanisms of variants that contribute to allelic imbalances.

[0055] Sequencing to generate short reads from high-throughput sequence data rarely allow one to directly observe which allelic variants are linked. The disclosure provides one or more methods that allow for determining which allelic variants are linked using allelic variants on read pairs.

[0056] Historically, only a few standard reference human genomes had been generated and were not haplotype resolved. As such, in order to determine a haplotype-resolved genome for a new individual, large scale sequencing needed to be performed, followed by mapping, and aligning the genome against the standard reference genome. In addition to mapping and aligning, haplotype phasing is performed by generating linkage data between sequences to determine sequences that reside on a same chromosome.

[0057] However, recent developments have allowed for the generation of large numbers of haplotype resolved genomes. With the increased amount of available reference genomes, individual haplotype genomes can be leveraged to identify variants in a haplotype context, and each individual haplotype can be reduced to a series of variants in a given order. Similarly, any given haplotype-resolved genome that needs to be assembled can be reframed as a series of variants that are linked in a given order. By identifying variants in a genome, and their linkage information to other variants, genomes can be efficiently assembled.

[0058] Target genome assembly may be performed by identifying a database of k-mers that represent variants (e.g., SNPs) in a reference genome panel. Each k-mer sequence in the database is unique such that the detection of any k-mer in a sequencing read specifically indicates the presence of a given variant. Any given genome (e.g., target genome or reference genome) can be represented as an order of these unique k-mers. Assembling a target genome can be performed by identifying which k-mers are present and determining the order of these k-mers.

[0059] Generation of k-mers from a reference genome may comprise identifying k-mers sequences that contain only one variant. By using k-mers with only one variant, the detection of a given k-mer sequence in a sequencing read may indicate the presence of specific variant. This may improve the accuracy of the methods as each variant is separately analyzed, without preexisting linkages to adjacent variants.

[0060] Generation of k-mers from a reference genome may comprises identifying k-mers sequences that are present in a single copy in a reference genome. By using k-mers present in a single copy in a reference genome, the detection of a given k-mer sequence in a sequencing read may indicate the presence of a specific sequence in a given genome . This may improve accuracy of the method as each k- mer can corresponds to a unique sequence and may eliminate confusion as to what sequence to use to assemble a genome.

[0061] For assembly of a target genome, sequencing data may be generated and analyzed for the presence of k-mers. For example, the presence of specific k-mer sequences can be identified in sequencing reads derived from a sample. A specific k-mer sequence may be associated with a specific variant (e.g., SNP) and specific reference genome sequence. As such, the presence of specific k-mer mayAttorney Docket No. 45269-750.601 indicate that the target genome has the variant and may indicate the sequences that surround the variant. The presence of more than one specific k-mer sequence may allow for linkage information to be generated. The presence of a first k-mer and a second k-mer on sequence may be used to identify that the k-mers are linked.

[0062] Once the sequencing data has been analyzed, a list of k-mers that are present in target genome can be generated. Using linkage data, an order of k-mers can be inferred such that the target genome is represented by a list of ordered k-mers. The list or order k-mers can then be aligned against the reference genomes. As described above, reference genomes can be represented as lists of ordered k-mers. The k- mers of the target genome can be aligned against a plurality of different reference genomes represented as ordered lists of k-mers. The alignment may be performed such to identify segments of the list of k- mers of the target genome that align best to a given reference genome. The segments of the reference genome that align to the k-mers of the sequencing reads can then be directly denoted as a segment of the target genome. This process of alignment can be repeated for each segment of the target genome, with each new segment aligning to a portion of a reference genome. The portions of the reference genomes that align can then be assembled together to form the assembled target genome.

[0063] The alignment may be performed using a greedy algorithm. For example, the alignment may identity a best alignment for the target genome’s list of k-mer from all possible alignments to a given reference genome panel. The alignment algorithm may find an alignment that provides a best score and then continue to perform an alignment for another non-overlapping segment of the list of ordered k-mers for the target genome.

[0064] Genome assembly methods of the present disclosure may be performed without mapping sequence reads to a reference genome. The methods may not require mapping to a reference genome. By using k-mers sequences derived from reference genomes, mapping may be avoided. Instead, the sequencing reads may be analyzed for the presence of sequences, as opposed to mapping to a specific location or sequence of a reference genome.

[0065] In various embodiments, reference genomes are used for generating assembled genomes. Reference genomes can be a part of a collection of genomes of different individuals. These reference genomes may be haplotype resolved. The reference genomes may be derived from or a component of the Human Pangenome. The Human Pangenome Reference Consortium relates to project for sequencing and generating genomes for many different individuals. Each genome may be generated from and specific to one individual. In contrast, the GRCh38.19 reference genome is a mosaic assembled from more than 20 individuals, with a 70% of the sequences contributed by one individual. A reference genome may be derived from a single individual and may be haplotype phased. Because the reference genome is derived from a single induvial and is haplotype phased, the reference genome may more accurately reflect sequences found in other individuals. These improved accuracies over the GRch38.19 reference genome may allow for more accurate sequences that may be used to generate a target genome. The referenceAttorney Docket No. 45269-750.601 genomes can be any previously assembled genome. For example, genomes assembled via the methods described herein may be used as reference genomes.

[0066] K-mers may be derived from a reference sequence, such a reference genome. K-mer sequences corresponding to variants may comprise a sequence of the length k. A reference genome may be separated or segmented into k-mers, wherein at least a portion of the genome is represented by a plurality of sequence of length k. A set of k-mers may comprise sequences corresponding to a genome that are of length k. A set of k-mer sequences can comprises all subsequences of a given sequence that comprises a length k. K-mer sequences can be a least 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 38, 39, 40, or more nucleotides long. K-mer sequences can be about 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 38, 39, 40, or more nucleotides. K-mer sequences can be no more than 2, 3, 4, 5, 6, 7, 8, 9, 10 ,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 38, 39, 40, or less nucleotides. Generating k-mer sequences may comprise generating all possible k-mer sequence of length k for a given sequence. Generating k-mer sequences may comprise generating a portion of all possible k-mer sequence of length k for a sequencing read. Generating k-mer sequences may comprise generating a portion of all possible k- mer sequence of length k for a sample. Generating a k-mer may comprise identifying a variant (e.g., SNP) and generating a k-mer that comprises the variant. The variant may be centered in a k-mer sequence. For example, an SNP may be at a middle location in a k-mer sequence.

[0067] In various embodiments, the methods and compositions of the disclosure enable the haplotype phasing of diploid or polyploid genomes with regard to a plurality of variants. The methods described herein can thus provide for the determination of linked variants that are linked based on variant information from read pairs and / or assembled contigs using the same. Examples of variants include, but are not limited to, those that are known from the lOOOgenomes, UK10K, HapMap and other projects for discovering genetic variation among humans. Disease association to a specific gene can be revealed more easily by having haplotype phasing data as demonstrated, for example, by the finding of unlinked, inactivating mutations in both copies of SH3TC2 leading to Charcot-Marie-Tooth neuropathy (Lupski JR, Reid JG, Gonzaga-Jauregui C, et al . N . Engl . J. Med . 362: 1181-91, 2010) and unlinked, inactivating mutations in both copies of ABCG5 leading to hypercholesterolemia 9 (Rios J, Stein E, Shendure J, et al. Hum. Mol. Genet. 19:4313-18, 2010). The variant may be a single nucleotide polymorphism (SNP). The allelic variants may be bi-allelic variants. For example, the locus may comprise either a first nucleotide or a second nucleotide at a given SNP site. In some cases, large structural variants may be detected with high sensitivity. In some cases, single nucleotide variants, indels, or copy number variations may be detected with whole genome sequencing (WGS)-like sensitivity. These methods may leverage existing Illumina platforms and may not require additional platforms or equipment. The methods and systems herein may be able to capture structural variation missed by WGS and RNA-seq.Attorney Docket No. 45269-750.601

[0068] In some cases, the methods and systems described herein may be integrated with capture panels. For example, an IDT xGen Exome Panel, IDT xGen CNV Panel, or whole exome sequencing may be used (FIG. 7A). Samples such as cells, blood, or tissue may be prepared, fixed, fragmented, proximity ligated, purified, undergo library conversion, and sequenced. The capture panels may then be integrated into the data, which may be aligned and have SNV / INDELs. CNVs, and / or SVs detected (FIG. 7B).

[0069] In some embodiments, the methods and systems described herein may be used for enhanced breakpoint detection. Frequently, breakpoint detection methods for shotgun data may fail when applied to HiC data. Common methods for breakpoint refinement may exploit the relatively short template sizes (approximately 500 basepairs) of shotgun paired-end libraries. In such methods, it may be easy to identify potential “split” reads overlapping the breakpoint. Local assembly of split reads may be successful due to a lack of chimeric reads. However, these methods often fail when applied to Hi-C data, due to the somewhat “unpredictable” nature of proximity linkage and an abundance of chimeric reads.

[0070] FIG. 8 shows graphs of coverage and clip consensus assembly. Other methods frequently use typical short read shotgun sequencing (e.g., Illumina short reads) to find breakpoints. As shown on the left side of FIG. 8, along the genome position on the X-axis, if reads are piled up, a sudden drop-off in coverage level may be seen. As shown on the right side of FIG. 8, if reads are aligned for where they clip, a consensus may be found of where they start and stop aligning.

[0071] Structural variant detection using other tools may be limited by matrix bin size resolution. FIG. 9 displays a graph of genomic coverage around a breakpoint from the matrix visualization method of breakpoint detection. The graph shows coverage at one level on one side of a breakpoint (lX-3X-fold coverage in the negative direction from the breakpoint center) that then drops to almost zero on the positive direction from the breakpoint center. Binning of reads into a matrix with bins of a certain size may cause limitations in the ability to finely resolve the breakpoints.

[0072] In some cases, a cumulative sum (CUSUM) statistic may be used to detect a data series inflection point. This method is generally designed for time series data. This method can calculate cumulative sums of deviations from a reference value and can detect changes when the cumulative sum exceeds a certain threshold. In some cases, a CUSUM inflection point may provide accurate readout of a base-pair resolution breakpoint, providing an alternative method for structural variant detection. FIG. 10 shows example CUSUM plots used for breakpoint detection. The dotted lines indicate the inflection points, which correspond to base-pair resolution breakpoints.

[0073] Breakpoints may be detected directly from soft- and hard -clipped reads. They may occur at a higher frequency compared to the chimeric ligation background. FIG. 11 shows soft and hard clipped reads, and various breakpoints that have been detected. For example, the chrl : 117366151 breakpoint was detected at a frequency of 10.

[0074] In some cases, the CUSUM method may be concordant with Sanger validation, and may provide exact base-pair resolution. FIG. 12 displays a comparison between the matrix visualization and the CUSUM / clipping methods and also compares these methods to Sanger validation. The CUSUM / clippingAttorney Docket No. 45269-750.601 method gives results at higher base pair resolution and closer to the Sanger results than the matrix visualization method.

[0075] The CUSUM method may also provide methods for handling cases without an exact breakpoint and handling multiple hits and / or ties. The CUSUM method may also comprise an in silico BUAT validation / visualization step.

[0076] Methods of the disclosure may generate assembled genomes which may be subjected to additional processes of refinement to improve accuracy of the assembled genome. For example, an assembled genome may be generated, and sequencing data may be aligned or mapped to the assembled genome. The assembled genome may initially be constructed using sequencing reads from a subject that are processed to identify a plurality of k-mers, and subsequently constructed via using reference genome sequences that correspond to the k-mers. The same sequencing reads may then be mapped to the newly assembled genome. The reference genome panels may lack specific variants present in a target genome, and as such, these variants may fail to be identified. Mapping the sequencing reads to the newly assembled genome may allow for discovery of these variants. Additionally, by mapping the sequence reads to the newly assembled genome, potential biases relating to variant detection may be reduced or eliminated as the sequence reads may be more similar to the sequence of the assembled genome as compared to a reference genome of the reference genome panel.

[0077] In certain embodiments, the methods disclosed herein comprise an in vitro technique to fix and capture associations among distant regions of a genome as needed for long-range linkage and phasing. In some cases, the method comprises constructing and sequencing an XURP library to deliver very genomically distant read pairs. In some cases, the interactions primarily arise from the random associations within a single DNA fragment. In some examples, the genomic distance between segments can be inferred because segments that are near to each other in a DNA molecule interact more often and with higher probability, while interactions between distant portions of the molecule will be less frequent. Consequently, there is a systematic relationship between the number of pairs connecting two loci and their proximity on the input DNA. The disclosure can produce read pairs capable of spanning the largest DNA fragments in an extraction. The input DNA for this library had a maximum length of 150 kbp, which is the longest meaningful read pair observed from the sequencing data. This suggests that the present method can link still more genomically distant loci if provided larger input DNA fragments. By applying improved assembly software tools that are specifically adapted to handle the type of data produced by the present method, a complete genomic assembly may be possible.

[0078] Uong -range linkage techniques can be combined k-mer calling or identification. By analyzing long range reads, relative distances of k-mers can be determined. Reads pairs with two k-mers that are frequently sequenced together may be inferred to be closer in distance than k-mers that appear less frequently in read -pairs together. In this way, an order of k-mers can be inferred from the long range linkage data.Attorney Docket No. 45269-750.601

[0079] In various embodiments, linkage information is generated, such as by analyzing sequencing reads generated via sample preparation using proximity ligation method. For example, linkage information may be generated by analyzing sequence reads derived from OmniC methods. The linkage information can be used to identify sequences that are on a same chromosome or haplotype. In the case of k-mers, the k-mer may correspond with a particular variant. The presence of two or more k-mers in sequence reads can indicate that the two or more associated variants are linked and may be part of a same chromosome or haplotype. By processing multiple sequencing reads, linkage information may be generated for a plurality of different k-mers (and a plurality of different variants). This linkage information may be used to generate a haplotype or chromosome sequence.

[0080] Linkage information may be represented as or compiled into a graph data structure. A graph data structure may comprise a set of nodes connected by edges (e.g., the edges represent relationships between the nodes). In an undirected graph, edges can be followed in any direction (e.g., B <-> C). In a directed graph, an edge may be followed in only one direction (e.g., C -> D). A generic example of a graph is given in FIG. 23.

[0081] In genomics, a commonly -used graph structure is a string graph. In this type of graph, the different nodes may comprise different variable -length sequences of DNA. These sequences of DNA can be longer than the read length. The edges may comprise variable -sized overlaps of connected nodes, but with overlaps shorter than the read length. These graphs are frequently used for genome assembly. An example of this type of genomic string graph is given in FIG. 24.

[0082] A common type of graph structure used in genomics is a de Bruijn graph. In a de Bruijn graph, edges may comprise fixed-length k-mers of size k base pairs, where k is less than the read length. The nodes may comprise overlaps between these k-mers of length k -1 base pairs. A de Bruijn graph may be used for reference-free variant discovery, small genome assembly, and / or local de novo assembly. An example of how a de Bruijn graph is built is given in FIG. 25.

[0083] K-mers that are detected in the sequencing reads may be nodes in a graph data structure and may be connected to represent the linkage information. The graph data structure may be processed to identify an order of k-mers that represent a genome. For example, the presence of two k-mer on a sequence read and denoted as linked may indicate that the k-mers are at adjacent locations on an ordered list. The graph data structure may indicate which k-mers are associated with one another as well as frequency of pairs k-mers appearing in sequencing reads. The frequency of pair appearing together may be used to determine adjacency of k-mers and order of k-mers in the target genome and may allow for an ordered list to be generated.

[0084] An ordered list of k-mers can be used to assemble a genome. K-mer sequences may be associated or correspond with specific sequences in a reference genome. The presence of the k-mer sequence in a sequence read of a target genome can indicate that a sequence of the corresponding reference genome is present in the target genome. Theses sequences of the corresponding genomes can be added together and combined to generate an assembled genome.Attorney Docket No. 45269-750.601

[0085] Another type of graph structure frequently used in genomics is a variation graph. In a variation graph, the nodes may comprise variable-length sequences of DNA, and the edges may comprise observed adjacencies between nodes. A variation graph may be used for alignment of related sequences or pangenome alignment. An example of a variation graph is given in FIG. 26.

[0086] Variation graphs may be stored in different data formats than those that are frequently used for next-generation sequencing data analysis. For example, instead of using FASTA for storing genome sequence(s), genome sequences may be stored in a Graphical Fragment Assembly (GFA) format for a variation graph. Instead of using a Burrows -Wheeler Transform (BWT) index genome for alignment, the index genome may be a Graph Burrows -Wheeler Transform (GBWT) for a variation graph. Instead of storing read alignments in a Binary Alignment Map (BAM) or Sequence Alignment / Map (SAM) format, read alignments may be stored in a Graph Alignment / Map (GAM) or Graph Alignment Format (GAF) format.

[0087] In some cases, it may be useful to align data to multiple (pan) genomes in order to better represent genomic diversity. The Human Pangenome Reference Consortium (HPRC) is a project funded by the National Human Genome Research Institute (NHGRI) to sequence and assemble genomes from individuals from diverse populations in order to better represent the genomic landscape of diverse human populations. Analyzing sequencing data against a single reference genome (like hg38) may create reference biases, which can affect variant discovery, gene-disease association studies, and the accuracy of genetic analyses. In the HPRC, at least 47 phased, diploid assemblies, selected to represent global genetic diversity, have been completed. This data comprises both long and linked-read data. 29 have been produced by HPRC, including Ik Genomes lymphoblastoid cell lines, with parental WGS used for phasing. 18 assemblies have been produced by other groups. Pangenome graphs can be built using such packages as minigraph, minigraph -cactus, and PanGenome Graph Builder tools.

[0088] A high-level diagram of an HPRC pangenome graph is given in FIG. 27. Haplotypes and reference genomes may be broken apart into sequences of DNA (nodes) connected according to the order found within their respective genome. A suite of tools developed by UCSC Genome Institute and others (e.g., vg or pggb) can enable the construction and manipulation of pangenome graphs, graph-based sequencing alignment and variant calling, etc.

[0089] A toolset developed by the UCSC Genome Institute, the University of Tennessee, and others called “vg toolset” may be used in pangenome analysis. Some examples of commands are “vg construct” (builds pangenomes from FASTA / VCFs), “vg view” (converts graph to different formats), “vg giraffe / map / mpmap” (maps reads onto the graph), “vg augment (adds variants to the graph), “vg pack / call (calls variants in the graph for an original or augmented graph), “vg suiject” (projects the graph onto a linear reference such as hg38 or T2T-CHM13), “vg ma” (constructs splicing graphs for pantranscriptomics), and “vg haplotypes” (samples the full graph for haplotypes consistent with k-mers extracted from input sample sequencing data).Attorney Docket No. 45269-750.601

[0090] In some embodiments, the methods and systems herein may phase pangenome graph nodes using data generated from the haplotype phasing methods described herein. A schematic of this method is shown in FIG. 28. Nodes in the graph may represent a nucleotide sequence of any length. A DNA sequence may be reconstructed by following a directed path through a set of nodes in the pangenome graph. Read sequences from the haplotype phasing methods described herein may be aligned onto the pangenome graph and represented as a directed path through a set of nodes. The first attempt of phasing the genome may use a MAXCUT -like algorithm to phase bi-allelic nodes (or groups of nodes) into phase blocks. This may be similar to HapCUT2 but performed in graph space. As indicated by 3001, the haplotype phasing -linked nodes can move the “CC” node onto a specific haplotype path.

[0091] FIG. 29 shows more information about an unphased haplotype in the ovarian tumor diploid genome (“Hapl”) as compared to a second unphased haplotype (“Hap2”) and the reference genome hg38. The haplotype is on contig #10 / chrl2. The contigs inherited telomeric ends from the T2T-CHM3 genome. 30,819 SNVs and 3,806 indels in chrl2 of hg38 were not found in the chrl2s of either Hapl or Hap2.

[0092] One example application of pangenome graphs and haplotype phasing is building custom databases of phased personalized genomes. Sequencing data may be collected on a specific cohort of samples, and phased diploid genomes may be produced and stored in a database, enabling cohort -level analyses (e.g., responders vs. non -responders). This may be possible without using long read sequencing data. Phased diploid references may be generated for each sample and stored in custom databases. A graph database may improve the identification of haplotypes and genetic variants that contribute to phenotypic differences. This procedure has a large benefit in terms of cost and time to generate compared to HiFi and ONT-based assembly. It may also improve efforts to effectively target drugs to population(s) that will most benefit.

[0093] Another application is cancer genome analysis using a pangenome graph, which could be useful for cancer researchers and clinical testing companies. Single patient proximity ligation data may be collected, and a personalized, phased diploid genome capturing all somatic variations may be generated. A cloud-based solution for the construction and analysis of personalized genomes may be used that could return a report on proximity ligation data input. The cloud-based solution may also provide a cloud- hosted portal for interacting with personalized genome data. This application may allow for germline genome imputation, allowing the user to filter for somatic mutations with a matched-normal dataset not required. It thus may be possible to perform cancer genome reconstruction on a personalized diploid genome. This may allow the creation of a platform for delivering truly personalized precision medicine.

[0094] Another application is haplotype-specific chromatin conformation, which could be useful to basic and cancer researchers as well as biopharmaceutical companies. An individual’s proximity ligation data could be used to produce a personalized, phased diploid genome with haplotype -specific chromatin conformation analysis. Epigenetics workflows could be performed on the personalized diploid genome. Additionally, differential analyses (e.g., locations of TADs / loops) comparing the two haplotypes couldAttorney Docket No. 45269-750.601 be performed. It may also be possible to use RNA-seq data to identify genes with imbalanced allelic expression and link those changes to allele-specific differences in 3D chromatin structure. This may allow for greater understanding of biological mechanisms for phenotype (e.g., disease, drug efficacy / biomarkers) .Long Range Haplotype Phasing

[0095] Disclosed herein are methods for generating read sets, including phased read-sets, for applications including genome assembly and haplotype phasing, using long -read or short-read sequencing technologies. Exemplary techniques include but are not limited to proximity ligation techniques such as Hi-C, Chicago, Micro-C, and Omni-C. The present disclosure provides methods and systems for genome assembly that may allow for a faster, easier workflow, high quality Hi-C data, and scaffolding and haplotype-resolved assemblies. Nucleic acid molecules can be bound (e.g., in a chromatin structure), cleaved to expose internal ends, re-attached at junctions to other exposed ends, freed from binding, and sequenced. This technique can produce nucleic acid molecules comprising multiple sequence segments. The multiple sequence segments within a nucleic acid molecule can have phase information preserved while being rearranged relative to their natural or starting position and orientation. Sequence segments on either side of a junction can be confidently considered to come from the same phase of a sample nucleic acid molecule. In an example, the method may use steps of crosslinking a chromatin structure, fragmenting the nucleic acids in the chromatin, ligating, or otherwise connecting exposed ends, reversing the crosslinking, and sequencing the resulting nucleic acids . Linkage of read information from two or more connected regions of a nucleic acid generated in this way can indicate that read information came from the same original nucleic acid molecule, allowing binning of maternal and paternal reads, and enabling phased blocks of sequence even spanning an entire chromosome. In some cases, the methods and systems described herein may capture linked information, enabling contig scaffolding and phasing. In some cases, the methods and systems described herein may allow for chromatin conformation (topology) analysis. In some cases, the methods and systems described herein may detect chromatin looping and other interactions.

[0096] Nucleic acid molecules, including high molecular weight DNA, can be bound or immobilized on at least one nucleic acid binding moiety. For example, DNA assembled into in vitro chromatin aggregates and fixed with formaldehyde treatment are consistent with methods herein. Nucleic acid binding or immobilizing approaches include, but are not limited to, in vitro or reconstituted chromatin assembly, native chromatin, DNA-binding protein aggregates, nanoparticles, DNA-binding beads, or beads coated using a DNA-binding substance, polymers, synthetic DNA-binding molecules or other solid or substantially solid affinity molecules. In some cases, the beads are solid phase reversible immobilization (SPRI) beads (e.g., beads with negatively charged carboxyl groups such as Beckman-Coulter Agencourt AMPure XP beads).

[0097] Nucleic acids bound to a nucleic acid binding moiety such as those described herein can be held such that a nucleic acid molecule having a first segment and a second segment separated on the nucleicAttorney Docket No. 45269-750.601 acid molecule by a distance greater than a read distance on a sequencing device (10 kb, 50 kb, 100 kb or greater, for example) are bound together independent of their common phosphodiester bonds. Upon cleavage of such a bound nucleic acid molecule, exposed ends of the first segment and the second segment may ligate to one another. In some cases, the nucleic acid molecules are bound at a concentration such that there is little or no overlap between bound nucleic acid molecules on a solid surface, such that exposed internal ends of cleaved molecules are likely to re -ligate or become reattached only to exposed ends from other segments that were in phase on a common nucleic acid source prior to cleavage. Consequently, a DNA molecule can be cleaved, and cleaved exposed internal ends can be re- ligated, for example at random, without loss of phase information.

[0098] A bound nucleic acid molecule can be cleaved to expose internal ends through one of any number of enzymatic and non-enzymatic approaches. For example, a nucleic acid molecule can be digested using a restriction enzyme, such as a restriction endonuclease that leaves a single stranded overhang. Mbol digest, for example, is suitable for this purpose, although other restriction endonucleases are contemplated. Lists of restriction endonucleases are available, for example, in most molecular biology product catalogues. Other non -limiting techniques for nucleic acid cleavage include using a transposase, tagmentation enzyme complex, topoisomerase, nonspecific endonuclease, DNA repair enzyme, RNA-guided nuclease, fragmentase, or alternate enzyme. Transposase, for example, can be used in combination with unlinked left and right borders to create a sequence -independent break in a nucleic acid that is marked by attachment of transposase -delivered oligonucleotide sequence. Physical means can also be used to generate cleavage, including mechanical means (e.g., sonication, shear), thermal means (e.g., temperature change), or electromagnetic means (e.g., irradiation, such as UV irradiation).

[0099] Immobilization of nucleic acids at this stage can keep the cleaved nucleic acid molecule fragments in close physical proximity, such that phase information for the initial molecule is preserved. A benefit of the fixation, e.g., to chromatin aggregates, is that separate regions of a common nucleic acid molecule can be held together independent of their phosphodiester backbone, such that their phase information is not lost upon cleavage of the phosphodiester backbone. This benefit is also conveyed through alternate scaffolds to which a nucleic acid molecule is attached prior to cleavage.

[0100] Optionally, single stranded “sticky” end overhangs are modified to prevent reannealing and re- ligation. For example, sticky ends are partially filled-in, such as by adding one nucleotide and a polymerase. In this way, the entire single -stranded end cannot be filled in, but the end is modified to prevent re-ligation with a formerly complementary end. In the example of Mbol digestion, which leaves a 5’ GATC 5-prime overhang, only the Guanosine nucleotide triphosphate is added. This results in only a “G” fill-in of the first complementary base (“C”) and result in a 5’ GAT overhang. This step renders the free sticky ends incompatible for re-ligation to one another but preserves sticky ends for downstream applications. Alternately, blunt ends are generated through completely filling in the overhangs, restriction digest with blunt-end generating enzymes, treatment with a single-strand DNA exonuclease, orAttorney Docket No. 45269-750.601 nonspecific cleavage. In some cases, atransposase is used to attach adapter ends having blunt or sticky ends to the exposed internal ends of the DNA molecule.

[0101] Optionally, a “punctuation oligonucleotide” is introduced. This punctuation oligonucleotide marks cleavage / re-ligation sites. Some punctuation oligonucleotides have single -stranded overhangs on both ends that are compatible with the partially filled-in overhangs generated on the exposed nucleic acid sample internal ends. An example of a punctuation oligonucleotide is shown below. In some cases, the double-stranded oligonucleotide having single-stranded overhangs is modified, such as by 5’ phosphate removal at its 5’ ends, so that it cannot form concatemers during ligation. Alternately, blunt punctuation oligonucleotides are used, or cleavage sites are not marked using a distinct punctuation oligonucleotide. In some systems, such as when a transposase is used, punctuation is accomplished through addition of transpososome border sequences, followed by ligation of border sequences to one another or to a punctuation oligo. An exemplary punctuation oligo is presented below. However, alternate punctuation oligos are consistent with the disclosure herein, varying in sequence, length, overhang presence or sequence, or modification such as 5’ de-phosphorylation.

[0102] In some cases, the double-stranded region of the punctuation oligonucleotide will vary. A relevant feature of the punctuation oligonucleotide is the sequence of its overhang, allowing ligation to the nucleic acid sample but optionally modified precluding auto-ligation or concatemer formation. It is often preferred that the punctuation oligonucleotide comprise sequence that does not occur or is less likely to occur in a target nucleic acid molecule, such that it is easily identified in a downstream sequence reaction. Punctuation oligos are optionally barcoded, for example with a known barcode sequence or with a randomly generated unique identifier sequence. Unique identifier sequences can be designed to make it highly unlikely for multiple junctions in a nucleic acid molecule or in a sample to be barcoded with the same unique identifier.

[0103] Cleaved ends can be attached to one another directly or through an oligo (e.g., a punctuation oligo), for example using a ligase or similar enzyme. Ligation can proceed such that the free single - stranded ends of an immobilized high-molecular weight nucleic acid molecule are ligated directly or to the punctuation oligonucleotide. Because the punctuation oligonucleotide, if utilized, can have two ligatable ends, this ligation can effectively chain regions of the high molecular weight nucleic acid molecule together. Alternative approaches resulting in affixing a punctuating sequence or molecule between two exposed ends can also be employed, as can approaches for directly connecting two exposed ends without punctuation.

[0104] Nucleic acids can then be liberated from the nucleic acid binding moiety. In the case of in vitro chromatin aggregates, this can be accomplished by reversing the cross-links, or digesting the protein components, or both reversing the crosslinking and digesting protein components. A suitable approach is treatment of complexes with proteinase K, though many alternatives are also contemplated. For other binding techniques, suitable methods can be employed, such as the severing of linker molecules or the degradation of a substrate.Attorney Docket No. 45269-750.601

[0105] Nucleic acid molecules resulting from such techniques can have a variety of relevant features. Sequence segments within a nucleic acid molecule can be rearranged relative to their natural or starting positions and orientations, but with phase information preserved. Consequently, sequence segments on either side of a junction can be confidently assigned to a common phase of a common sample molecule. Thus, segments far removed from one another on a molecule can be, by such techniques, brought together or in proximity such that portions or the entirety of each segment is sequenced in a single run of a single molecule sequencing device, allowing definitive phase assignment. Alternately, in some cases originally adjacent segments can become separated from one in the resultant nucleic acid. In some cases, the nucleic acid molecules can be re-ligated such that at least about 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, 99.999%, or 100% of re-ligations are between segments that were in phase on a common nucleic acid source prior to cleavage.

[0106] Another relevant feature of the resultant molecules is that, in some cases, most or all the original molecular sequence is preserved, though perhaps rearranged, in the final punctuated or rearranged molecule. For example, in some cases no more than 1%, 2%, 3%, 4%, 5%, 10%, 15%, or 20% of the original molecule is lost in producing the resultant molecule or molecules. Consequently, in addition to being useful as a phase determinant, the resultant molecule retains a substantial proportion of the original molecule sequence, such that the resultant molecule is optionally used to concurrently generate sequence information such as contig information useful in de novo sequencing or as independent verification of previously generated contig information.

[0107] Another feature of libraries of some resultant molecules is that cleavage junctions are not common to multiple members of a population of resultant molecules. That is, that different copies of the same starting nucleic acid molecule can end up with different patterns of junction and rearrangement. Random cleavage junctions can be generated with a non-specific cleavage molecule, or through variation in restriction endonuclease selection or digestion parameters.

[0108] A consequence of having molecule-specific cleavage sites is that in some cases punctuation oligonucleotides are optionally excluded from the process that results in the ‘punctuation molecule’ reshuffling and re-ligation to no ill effect. By aligning segments of three or more reshuffled molecules, one observes that cleavage sites are readily identified by their absence in the majority of other members of a library. That is, when three or more reshuffled molecules are locally aligned, a segment can be found to be common to all of the molecules, but the edges of the segment can vary among the molecules. By noting where segment local sequence similarity ends, one can map cleavage junctions in an ‘unpunctuated’ rearranged nucleic acid molecule.

[0109] The resulting nucleic acid molecules can be sequenced, for example on a long -read sequencer. The resulting sequence reads contain segments that alternate between nucleic acid sequence from the original input molecule and, if they are used, sequences of the punctuation oligo. These reads can be processed by a computer to split sequence data from each read using the punctuation oligonucleotideAttorney Docket No. 45269-750.601 sequence, or are otherwise processed to identify junctions. The sequence segments within each read can be segments from a single input high molecular weight DNA molecule. The original nucleic acid molecule can comprise a genome sequence or fraction thereof, such as a chromosome. The sets of segment reads can be discontinuous in the original nucleic acid molecule but reveal long-range, haplotype-phased data. These data can be used for de novo genome assembly and phasing heterozygous positions in the input genome. Sequence between junctions indicates contiguous nucleic acid sequence in the source nucleic acid sample, while sequence across a junction is indicative of a nucleic acid segment that is in phase in the nucleic acid sample but that may be far removed in the arranged scaffold from the adjacent segment.

[0110] Junctions can be identified by a variety of approaches. If punctuation oligos are used, junctions can be identified at reads containing the punctuation oligo sequence. Alternately, junctions can be identified by comparison to a second sequence source (and, preferably, a third sequence source) for a nucleic acid molecule, such as a previously generated contig sequence dataset or a second, independently generated DNA chain molecule having independently derived junctions. As the sequence is aligned, for example, the quality or confidence of alignment to a particular location can indicate where one segment ends and another begins. If restriction enzymes are used to generate cleavages, sequences containing the restriction enzyme recognition site can be evaluated for potentially containing a junction. Note that not every restriction enzyme recognition site may contain a junction, as some restriction enzyme recognition sites may not have been physically accessible by the enzyme while the nucleic acid was bound to the support, for example. Statistical information can also be employed in identifying junctions; for example, the length segments between junctions may be predicted to be of a certain average value or to follow a certain distribution.

[0111] A benefit of the manipulations herein is that they can preserve molecular phase information while bringing nonadj acent regions of the molecule in proximity such that they are included in a single nucleic acid molecule at a distance suitable for sequencing in a single read, such as a long read. Thus, regions that are separated in the starting sample by greater than the distance of a single long read operation (for example 10 kb, 15 kb, 20 kb, 30 kb, 50 kb, 100 kb or greater) are brought into local proximity such that they are within the distance covered by a single read of a long-range sequencing reaction. Thus, regions that are separated by more than the range of the sequencing technology for a single read in the original sample are read in a single reaction in the phase-preserved, rearranged molecule.

[0112] Resultant rearranged molecules can be sequenced, and their sequence information mapped to independently or concurrently generated sequence reads or contig information, or to a known reference genome sequence (for example, the known sequence of the human genome). Segments adjacent on the resultant rearranged molecule reads are presumed to be in phase. Accordingly, when these segments are mapped to disparate contigs or long range sequence reads, the reads are assigned to a common phase of a common molecule in the sequence assembly.Attorney Docket No. 45269-750.601

[0113] Alternately, if multiple independently generated resultant rearranged molecules are sequenced concurrently, phased sample data is optionally generated from these molecules alone, such that segment sequences separated by junctions are inferred to be in phase, while sequences not separated by junctions are inferred to represent stretches of nucleic acids contiguous in the sample itself and useful for, for example, de novo sequence determination as well as being useful for phase determination. However, additionally or as an alternative, multiple independently generated resultant rearranged molecules sequenced concurrently can still be compared to independently generated scaffold or contig information.

[0114] Methods and compositions presented herein can preserve long-range phase information, particularly for molecule segments separated by greater than the length of a read in a sequencing technology (10 kb, 20 kb, 50 kb, 100 kb, 500 kb or greater, for example), while providing such nonadjacent segments in a rearranged or often ‘punctuated’ molecule where the segments are adjacent or close enough to be covered by a single read.

[0115] In some instances, resultant rearranged molecules are combined with native molecules for sequencing. The native molecules can be recognized and utilized informatically by the lack of punctuation sequences, if employed. Native molecules are sequenced using short or long read technology, and their assembly is guided by the phase information and segment sequence information generated through sequencing of the rearranged molecule or library.Haplotype Phasing

[0116] Extremely high phasing accuracy can be achieved by the data produced using the methods and compositions of the disclosure. In comparison to previous methods, the methods described herein can phase a higher proportion of the variants. Phasing can be achieved while maintaining high levels of accuracy. The techniques herein can allow for phasing at an accuracy of greater than about 70%, 80%, 90%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, or 99.999%. The techniques herein can allow for accurate phasing with less than about 500x sequencing depth, 45 Ox sequencing depth, 400x sequencing depth, 350x sequencing depth, 300x sequencing depth, 250x sequencing depth, 200x sequencing depth, 150x sequencing depth, lOOx sequencing depth, or 50x sequencing depth. This phase information can be extended to longer ranges, for example, greater than about 200 kbp, about 300 kbp, about 400 kbp, about 500 kbp, about 600 kbp, about 700 kbp, about 800 kbp, about 900 kbp, about IMbp, about 2Mbp, about 3 Mbp, about 4 Mbp, about 5Mbp, or about 10 Mbp. In some embodiments, more than 90% of the heterozygous SNPs for a human sample can be phased at an accuracy greater than 99% using less than about 250 million reads or read pairs, e.g., by using only 1 lane of Illumina HiSeq data. In other cases, more than about 40%, 50%, 60%, 70%, 80%, 90 %, 95%, or 99% of the heterozygous SNPs for a human sample can be phased at an accuracy greater than about 70%, 80%, 90%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, or 99.999% using less than about 250 million or about 500 million reads or read pairs, e.g., by using only 1 or 2 lanes of Illumina HiSeq data. For example, more than 95% or 99% of the heterozygous SNPs for a human sample can be phase at an accuracy greater than about 95% or 99% using less about 250 million or about 500 million reads. In further cases, additional variants can beAttorney Docket No. 45269-750.601 captured by increasing the read length to about 200 bp, 250 bp, 300 bp, 350 bp, 400 bp, 450 bp, 500 bp, 600 bp, 800 bp, 1000 bp, 1500 bp, 2 kbp, 3 kbp, 4 kbp, 5 kbp, 10 kbp, 20 kbp, 50 kbp, or 100 kbp.

[0117] Methods herein can further comprise subjecting the plurality of segments to size selection to obtain a plurality of selected segments. The size selection herein can include any suitable range of segment sizes.

[0118] Cleaving in methods provided herein can be done using any suitable method, for example by using a nuclease or a deoxyribonuclease (DNase). In some cases, the DNase comprises DNase I, DNasell, micrococcal nuclease, a restriction endonuclease, or a combination thereof.

[0119] Stabilized biological samples in methods herein can be stabilized by being treated with a stabilizing agent or a crosslinking reagent. In some cases, the crosslinking agent is a chemical fixative, such as formaldehyde, psoralen, disuccinimidyl glutarate (DSG), ethylene glycol bis(succinimidyl succinate) (EGS), ultraviolet light, or a combination thereof. In some cases, the crosslinking agent comprises chlormethine, cyclophosphamide, chlorambucil, uramustine, melphalan, bendamustine, bis(2- chloroethyljethylamine, bis(2-chloroethyl)methylamine, tris(2-chloroethyl)amine, isofamide, carmustine, lomustine, streptozocin, busulfan, cisplatin, carboplatin, cicycloplatin, eptaplatin, lobaplatin, miriplatin, nedaplatin, oxaliplatin, picoplatin, satraplatin, triplatin tetranitrate, procarbazine, altretamine, dacarbazine, mitozolomide, temozolomide, mitomycin C, nitrous acid, formaldehyde, acetylaldehyde, doxorubicin, daunorubicin, epirubicin, or idarubicin. In some cases, the crosslinking agent comprises an intercalating agent, an antibiotic, or a minor groove binding agent. The stabilized biological sample can be a crosslinked paraffin-embedded tissue sample. In some cases, the stabilized biological sample comprises a stabilized intact cell or a stabilized intact nucleus. In some cases, the method comprises lysing cells and / or nuclei in the stabilized biological sample. The cleaving step of methods herein can be conducted prior to lysis of the intact cell or the intact nucleus. In various aspects of methods herein, cleaving stabilized nucleic acids may be conducted using a transposase. In some cases, cleaving may be conducted in permeabilized cells. In some cases, cleaving may be conducted in permeabilized nuclei. In some cases, the transposase may be a Tn5, a Tn3, a Tn7, a sleeping beauty transposase, or a combination thereof. In some cases, transposase may be a Tn5 transposase.

[0120] Methods herein can be conducted on stabilized biological samples comprising small numbers of cells. For example, in some cases, the stabilized biological sample comprises fewer than about 3,000,000 cells. The stabilized biological sample can comprise fewer than about 1,000,000 cells, fewer than about 500,000 cells, fewer than about 400,000 cells, fewer than about 300,000 cells, fewer than about 200,000 cells, fewer than about 100,000 cells, or fewer.

[0121] In aspects of methods herein, the method can further comprise obtaining at least some sequence on each side of the junction to generate a first read pair. In addition, the method can further comprise mapping the first read pair to a set of contigs; and determining a path through the set of contigs that represents an order and / or orientation to a genome. Alternatively, or in combination, the method can comprise mapping the first read pair to a set of contigs; and determining, from the set of contigs, aAttorney Docket No. 45269-750.601 presence of a structural variant or loss of heterozygosity in the stabilized biological sample. Alternatively, or in combination, the method can further comprise mapping the first read pair to a set of contigs; and assigning a variant in the set of contigs to a phase. Alternatively, or in combination, the method can further comprise mapping the first read pair to a set of contigs; determining, from the set of contigs, a presence of a variant in the set of contigs; and conducting a step selected from one or more of: identifying a disease stage, a prognosis, or a course of treatment for the stabilized biological sample; selecting a drug based on the presence of the variant; or identifying a drug efficacy for the stabilized biological sample.

[0122] In aspects of methods herein, proximity ligation can be conducted with click chemistry, including copper-free click chemistry, such as with a DBCO modified bridge oligonucleotide attached between each segment of a concatemer. Then concatemers can be joined, for example via dendrimers. To enrich for the ligated molecules, a feature of the bridge oligonucleotide can be targeted. In an example, a DBCO containing oligonucleotide can be reacted with an azide -biotin moiety which can be isolated with a streptavidin substrate, such as beads. In another example, a DBCO containing oligo nucleotide can be reacted with an azide -modified NHS-S-S-dPEG4-biotin which comprises a disulfide bond; azide can be added to the NHS-S-S-dPEG4-biotin using an azido-PEG3 -amine, and in order to isolate the nucleic acids for library preparation, this disulfide bond can be reduced, for example by using DTT and heating, for example heating at 70°C for about 10 minutes.

[0123] In aspects of methods herein, dendrimers with nucleic acid fragments contacted to them can be separated or isolated from the rest of the nucleic acids in the sample prior to proximity ligation of the nucleic acid fragments. This step can ensure that the concatemers formed by the proximity ligation comprise fragments that were contacted to the same dendrimer. This can mean that all the segments of a given concatemer were in proximity to each other in the original stabilized sample. Therefore, rather than just pairwise information about which nucleic acid regions were proximate to which other regions, such an approach can yield much more complex proximity information - e.g., that 3, 4, 5, 6, 7, 8, 9, 10, or more nucleic acid regions were all proximate to each other.Sequencing

[0124] In various embodiments, sequencing reads are generated and processed. Sequencing reads can be paired end reads, single short reads, long reads (e.g., via nanopore, SMRT, HiFi) or any other suitable sequence read format. In some cases, sequencing reads are obtained using any suitable proximity ligation method such as Hi-C, Chicago, Micro-C, or Omni-C. In some cases, sequencing reads are obtained using Micro-C. The proximity ligation reads can be paired end reads, with each read of the pair providing sequence information of one end or the other of both sides of a proximity ligation site. The proximity ligation reads can be long reads, with one long read providing sequence information of both sides of a proximity ligation site, or even sequence information surrounding multiple proximity ligations sites in one concatemer.Attorney Docket No. 45269-750.601

[0125] In various embodiments, suitable sequencing methods described herein or otherwise known will be used to obtain sequence information from nucleic acid molecules within a sample. Sequencing can be accomplished through classic Sanger sequencing methods. Sequence can also be accomplished using high-throughput systems some of which allow detection of a sequenced nucleotide immediately after or upon its incorporation into agrowing strand, i.e., detection of sequence in real time or substantially real time. In some cases, high-throughput sequencing generates at least 1,000, at least 5,000, at least 10,000, at least 20,000, at least 30,000, at least 40,000, at least 50,000, at least 100,000 or at least 500,000 sequence reads per hour; where the sequencing reads can be at least about 50, about 60, about 70, about 80, about 90, about 100, about 120, about 150, about 180, about 210, about 240, about 270, about 300, about 350, about 400, about 450, about 500, about 600, about 700, about 800, about 900, or about 1000 bases per read.

[0126] Sequencing can be whole-genome, with or without enrichment of particular regions of interest. Sequencing can be targeted to particular regions of the genome. Regions of the genome that can be enriched for or targeted include but are not limited to single genes (or regions thereof), gene panels, gene fusions, human leukocyte antigen (HLA) loci (e.g., Class I HLA-A, B, and C; Class II HLA-DRB1 / 3 / 4 / 5, HLA-DQA1, HLA-DQB1, HLA-DPA1, HLA-DPB1), exonic regions, exome, and other loci. Genomic regions can be relevant to immune response, immune repertoire, immune cell diversity, transcription (e.g., exome), cancers (e.g., BRCA1, BRCA2, panels of genes or regions thereof such as hotspot regions, somatic variants, SNVs, amplifications, fusions, tumor mutational burden (TMB), microsatellite instability (MSI)), cardiac diseases, inherited diseases, and other diseases or conditions. A variety of methods can be used to enrich for or target regions of interest, including but not limited to sequence capture. In some cases, Capture Hi-C (CHi-C) or CHi-C-like protocols are employed, employing a sequence capture step (e.g., by target enrichment array) before or after library preparation.

[0127] In some embodiments, high-throughput sequencing involves the use of technology available by Illumina’s Genome Analyzer IIX, MiSeq personal sequencer, or HiSeq systems, such as those using HiSeq 2500, HiSeq 1500, HiSeq 2000, or HiSeq 1000 machines. These machines use reversible terminator-based sequencing by synthesis chemistry. These machines can do 200 billion DNA reads or more in eight days. Smaller systems may be utilized for runs within 3, 2, 1 days or less time.

[0128] In some embodiments, high-throughput sequencing involves the use of technology available by ABI Solid System. This genetic analysis platform that enables massively parallel sequencing of clonally - amplified DNA fragments linked to beads. The sequencing methodology is based on sequential ligation with dye-labeled oligonucleotides.

[0129] The next generation sequencing can comprise ion semiconductor sequencing (e.g., using technology from Life Technologies (Ion Torrent)). Ion semiconductor sequencing can take advantage of the fact that when a nucleotide is incorporated into a strand of DNA, an ion can be released. To perform ion semiconductor sequencing, a high-density array of micromachined wells can be formed. Each well can hold a single DNA template. Beneath the well can be an ion sensitive layer, and beneath the ionAttorney Docket No. 45269-750.601 sensitive layer can be an ion sensor. When a nucleotide is added to a DNA, H+ can be released, which can be measured as a change in pH. The H+ ion can be converted to voltage and recorded by the semiconductor sensor. An array chip can be sequentially flooded with one nucleotide after another. No scanning, light, or cameras can be required. In some cases, an IONPROTON™ Sequencer is used to sequence nucleic acid. In some cases, an IONPGM™ Sequencer is used. The Ion Torrent Personal Genome Machine (PGM). The PGM can do 10 million reads in two hours.

[0130] In some embodiments, high-throughput sequencing involves the use of technology available by Helicos BioSciences Corporation (Cambridge, Massachusetts) such as the Single Molecule Sequencing by Synthesis (SMSS) method. SMSS is unique because it allows for sequencing the entire human genome in up to 24 hours. Finally, SMSS is described in part in US Publication Application Nos. 20060024711; 20060024678; 20060012793; 20060012784; and 20050100932.

[0131] In some embodiments, high-throughput sequencing involves the use of technology available by 454 Lifesciences, Inc. (Branford, Connecticut) such as the PicoTiterPlate device which includes a fiber optic plate that transmits chemiluminescent signal generated by the sequencing reaction to be recorded by a CCD camera in the instrument. This use of fiber optics allows for the detection of a minimum of 20 million base pairs in 4.5 hours.

[0132] Methods for using bead amplification followed by fiber optics detection are described in Marguiles, M., et al. “Genome sequencing in microfabricated high -density picolitre reactors,” Nature, doi: 10.1038 / nature03959; and well as in US Publication Application Nos. 20020012930; 20030068629; 20030100102; 20030148344; 20040248161; 20050079510, 20050124022; and 20060078909.

[0133] In some embodiments, high-throughput sequencing is performed using Clonal Single Molecule Array (Solexa, Inc.) or sequencing-by-synthesis (SBS) utilizing reversible terminator chemistry. These technologies are described in part in US Patent Nos. 6,969,488; 6,897,023; 6,833,246; 6,787,308; and US Publication Application Nos. 20040106110; 20030064398; 20030022207; and Constans, A., The Scientist 2003, 17(13):36.

[0134] The next generation sequencing technique can comprise real-time (SMRT™) technology by Pacific Biosciences. In SMRT, each of four DNA bases can be attached to one of four different fluorescent dyes. These dyes can be phospho linked. A single DNA polymerase can be immobilized with a single molecule of template single stranded DNA at the bottom of a zero -mode waveguide (ZMW). A ZMW can be a confinement structure which enables observation of incorporation of a single nucleotide by DNA polymerase against the background of fluorescent nucleotides that can rapidly diffuse in an out of the ZMW (in microseconds). It can take several milliseconds to incorporate a nucleotide into a growing strand. During this time, the fluorescent label can be excited and produce a fluorescent signal, and the fluorescent tag can be cleaved off. The ZMW can be illuminated from below. Attenuated light from an excitation beam can penetrate the lower 20-30 nm of each ZMW. A microscope with a detection limit of 20 zepto liters (20x IO-21liters) can be created. The tiny detection volume can provide 1000-foldAttorney Docket No. 45269-750.601 improvement in the reduction of background noise. Detection of the corresponding fluorescence of the dye can indicate which base was incorporated. The process can be repeated.

[0135] In some cases, the next generation sequencing is nanopore sequencing (see, e.g., Soni GV and Meller A. (2007) Clin Chem 53: 1996-2001). A nanopore can be a small hole, of the order of about one nanometer in diameter. Immersion of a nanopore in a conducting fluid and application of a potential across it can result in a slight electrical current due to conduction of ions through the nanopore. The amount of current which flows can be sensitive to the size of the nanopore. As a DNA molecule passes through a nanopore, each nucleotide on the DNA molecule can obstruct the nanopore to a different degree. Thus, the change in the current passing through the nanopore as the DNA molecule passes through the nanopore can represent a reading of the DNA sequence. The nanopore sequencing technology can be from Oxford Nanopore Technologies; e.g., a GridlON system. A single nanopore can be inserted in a polymer membrane across the top of a microwell. Each microwell can have an electrode for individual sensing. The microwells can be fabricated into an array chip, with 100,000 or more microwells (e.g., more than 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, 900,000, or 1,000,000) per chip. An instrument (or node) can be used to analyze the chip. Data can be analyzed in real-time. One or more instruments can be operated at a time. The nanopore can be a protein nanopore, e.g., the protein alpha-hemolysin, a heptameric protein pore. The nanopore can be a solid-state nanopore made, e.g., a nanometer sized hole formed in a synthetic membrane (e.g., SiNx, or SiO2). The nanopore can be a hybrid pore (e.g., an integration of a protein pore into a solid-state membrane). The nanopore can be a nanopore with an integrated sensor (e.g., tunneling electrode detectors, capacitive detectors, or graphene-based nano-gap or edge state detectors (see e.g., Garaj et al. (2010) Nature vol. 67, doi: 10.1038 / nature09379)). A nanopore can be functionalized for analyzing a specific type of molecule (e.g., DNA, RNA, or protein). Nanopore sequencing can comprise “strand sequencing” in which intact DNA polymers can be passed through a protein nanopore with sequencing in real time as the DNA translocates the pore. An enzyme can separate strands of a double stranded DNA and feed a strand through a nanopore. The DNA can have a hairpin at one end, and the system can read both strands. In some cases, nanopore sequencing is “exonuclease sequencing” in which individual nucleotides can be cleaved from a DNA strand by a processive exonuclease, and the nucleotides can be passed through a protein nanopore. The nucleotides can transiently bind to a molecule in the pore (e.g., cyclodextran). A characteristic disruption in current can be used to identify bases.

[0136] Nanopore sequencing technology from GENIA can be used. An engineered protein pore can be embedded in a lipid bilayer membrane. “Active Control” technology can be used to enable efficient nanopore-membrane assembly and control of DNA movement through the channel. In some cases, the nanopore sequencing technology is from NABsys. Genomic DNA can be fragmented into strands of average length of about 100 kb. The 100 kb fragments can be made single stranded and subsequently hybridized with a 6-mer probe. The genomic fragments with probes can be driven through a nanopore, which can create a current-versus- time tracing. The current tracing can provide the positions of theAttorney Docket No. 45269-750.601 probes on each genomic fragment. The genomic fragments can be lined up to create a probe map for the genome. The process can be done in parallel for a library of probes. A genome -length probe map for each probe can be generated. Errors can be fixed with a process termed “moving window Sequencing By Hybridization (mwSBH).” In some cases, the nanopore sequencing technology is from IBM / Roche. An electron beam can be used to make a nanopore sized opening in a microchip. An electrical field can be used to pull or thread DNA through the nanopore. A DNA transistor device in the nanopore can comprise alternating nanometer sized layers of metal and dielectric. Discrete charges in the DNA backbone can get trapped by electrical fields inside the DNA nanopore. Turning off and on gate voltages can allow the DNA sequence to be read.

[0137] The next generation sequencing can comprise DNA nanoball sequencing (as performed, e.g., by Complete Genomics; see e.g., Drmanac et al. (2010) Science 327: 78-81). DNA can be isolated, fragmented, and size selected. For example, DNA can be fragmented (e.g., by sonication) to a mean length of about 500 bp. Adaptors (Adi) can be attached to the ends of the fragments. The adaptors can be used to hybridize to anchors for sequencing reactions. DNA with adaptors bound to each end can be PCR amplified. The adaptor sequences can be modified so that complementary single strand ends bind to each other forming circular DNA. The DNA can be methylated to protect it from cleavage by a type IIS restriction enzyme used in a subsequent step. An adaptor (e.g., the right adaptor) can have a restriction recognition site, and the restriction recognition site can remain non -methylated. The non-methylated restriction recognition site in the adaptor can be recognized by a restriction enzyme (e.g., Acul), and the DNA can be cleaved by Acul 13 bp to the right of the right adaptor to form linear double stranded DNA. A second round of right and left adaptors (Ad2) can be ligated onto either end of the linear DNA, and all DNA with both adaptors bound can be PCR amplified (e.g., by PCR). Ad2 sequences can be modified to allow them to bind each other and form circular DNA. The DNA can be methylated, but a restriction enzyme recognition site can remain non-methylated on the left Adi adaptor. A restriction enzyme (e.g., Acul) can be applied, and the DNA can be cleaved 13 bp to the left of the Adi to form a linear DNA fragment. A third round of right and left adaptor (Ad3) can be ligated to the right and left flank of the linear DNA, and the resulting fragment can be PCR amplified. The adaptors can be modified so that they can bind to each other and form circular DNA. A type III restriction enzyme (e.g., EcoP15) can be added; EcoP15 can cleave the DNA 26 bp to the left of Ad3 and 26 bp to the right of Ad2. This cleavage can remove a large segment of DNA and linearize the DNA once again. A fourth round of right and left adaptors (Ad4) can be ligated to the DNA, the DNA can be amplified (e.g., by PCR), and modified so that they bind each other and form the completed circular DNA template.

[0138] Rolling circle replication (e.g., using Phi 29 DNA polymerase) can be used to amplify small fragments of DNA. The four adaptor sequences can contain palindromic sequences that can hybridize, and a single strand can fold onto itself to form a DNA nanoball (DNB™) which can be approximately 200-300 nanometers in diameter on average. A DNA nanoball can be attached (e.g., by adsorption) to a microarray (sequencing flowcell). The flow cell can be a silicon wafer coated with silicon dioxide,Attorney Docket No. 45269-750.601 titanium and hexamethyldisilazane (HMDS) and a photoresist material. Sequencing can be performed by unchained sequencing by ligating fluorescent probes to the DNA. The color of the fluorescence of an interrogated position can be visualized by a high-resolution camera. The identity of nucleotide sequences between adaptor sequences can be determined.

[0139] In some embodiments, high-throughput sequencing can take place using AnyDot.chips (Genovoxx, Germany). In particular, the AnyDot.chips allow for lOx - 50x enhancement of nucleotide fluorescence signal detection. AnyDot.chips and methods for using them are described in part in International Publication Application Nos. WO 02088382, WO 03020968, WO 03031947, WO 2005044836, PCT / EP 05 / 05657, PCT / EP 05 / 05655; and German Patent Application Nos. DE 101 49 786, DE 102 14 395, DE 103 56 837, DE 10 2004 009 704, DE 10 2004 025 696, DE 10 2004 025 746, DE 10 2004 025 694, DE 10 2004 025 695, DE 10 2004 025 744, DE 10 2004 025 745, and DE 10 2005 012 301.

[0140] Other high-throughput sequencing systems include those disclosed in Venter, J., et al. Science 16 February 2001; Adams, M. et al. Science 24 March 2000; and M. J. Levene, et al. Science 299:682-686, January 2003; as well as US Publication Application No. 20030044781 and 2006 / 0078937. Overall, such systems involve sequencing a target nucleic acid molecule having a plurality of bases by the temporal addition of bases via a polymerization reaction that is measured on a molecule of nucleic acid, i.e., the activity of a nucleic acid polymerizing enzyme on the template nucleic acid molecule to be sequenced is followed in real time. Sequence can then be deduced by identifying which base is being incorporated into the growing complementary strand of the target nucleic acid by the catalytic activity of the nucleic acid polymerizing enzyme at each step in the sequence of base additions. A polymerase on the target nucleic acid molecule complex is provided in a position suitable to move along the target nucleic acid molecule and extend the oligonucleotide primer at an active site. A plurality of labeled types of nucleotide analogs are provided proximate to the active site, with each distinguishable type of nucleotide analog being complementary to a different nucleotide in the target nucleic acid sequence. The growing nucleic acid strand is extended by using the polymerase to add a nucleotide analog to the nucleic acid strand at the active site, where the nucleotide analog being added is complementary to the nucleotide of the target nucleic acid at the active site. The nucleotide analog added to the oligonucleotide primer as a result of the polymerizing step is identified. The steps of providing labeled nucleotide analogs, polymerizing the growing nucleic acid strand, and identifying the added nucleotide analog are repeated so that the nucleic acid strand is further extended, and the sequence of the target nucleic acid is determined.Nucleic Acids

[0141] In eukaryotes, genomic DNA is packed into chromatin to consist as chromosomes within the nucleus. The basic structural unit of chromatin is the nucleosome, which consists of 146 base pairs (bp) of DNA wrapped around a histone octamer. The histone octamer consists of two copies each of the core histone H2A-H2B dimers and H3-H4 dimers. Nucleosomes are regularly spaced along the DNA in what is commonly referred to as “beads on a string.”Attorney Docket No. 45269-750.601

[0142] The assembly of core histones and DNA into nucleosomes is mediated by chaperone proteins and associated assembly factors. Nearly all of these factors are core histone -binding proteins. Some of the histone chaperones, such as nucleosome assembly protein-1 (NAP-1), exhibit a preference for binding to histones H3 and H4. It has also been observed that newly synthesized histones are acetylated and then subsequently deacetylated after assembly into chromatin. The factors that mediate histone acetylation or deacetylation therefore play an important role in the chromatin assembly process.

[0143] In general, two in vitro methods have been developed for reconstituting or assembling chromatin. One method is ATP -independent, while the second is ATP -dependent. The ATP -independent method for reconstituting chromatin involves the DNA and core histones plus either a protein like NAP-1 or salt to act as a histone chaperone. This method results in a random arrangement of histones on the DNA that does not accurately mimic the native core nucleosome particle in the cell. These particles are often referred to as mononucleosomes because they are not regularly ordered, extended nucleosome arrays and the DNA sequence used is usually not longer than 250 bp (Kundu, T. K. et al., Mol. Cell 6: 551 -561, 2000). To generate an extended array of ordered nucleosomes on a greater length of DNA sequence, the chromatin can be assembled through an ATP -dependent process.

[0144] The ATP -dependent assembly of periodic nucleosome arrays, which are similar to those seen in native chromatin, requires the DNA sequence, core histone particles, a chaperone protein and ATP- utilizing chromatin assembly factors. ACF (ATP -utilizing chromatin assembly and remodeling factor) or RSF (remodeling and spacing factor) are two widely researched assembly factors that are used to generate extended ordered arrays of nucleosomes into chromatin in vitro (Fyodorov, D.V., and Kadonaga, J.T. Method Enzymol. 371: 499-515, 2003; Kundu, T. K. et al. Mol. Cell 6: 551-561, 2000).

[0145] Nucleic acid obtained from biological samples can be fragmented to produce suitable fragments for analysis. Template nucleic acids may be fragmented to desired length, using a variety of enzymatic methods. DNA may be randomly sheared brief exposure to a DNase. RNA may be fragmented by brief exposure to an RNase, heat plus magnesium, or by shearing. The RNA may be converted to cDNA. If fragmentation is employed, the RNA may be converted to cDNA before or after fragmentation. Nucleic acid molecules may be single-stranded, double-stranded, or double-stranded with single-stranded regions (for example, stem- and loop-structures).

[0146] In particular embodiments, the methods of the disclosure can be easily applied to any type of fragmented double stranded DNA including, but not limited to, for example, free DNA isolated from plasma, serum, and / or urine; apoptotic DNA from cells and / or tissues; and / or DNA fragmented enzymatically in vitro (for example, by DNase I). Stabilized nucleic acids can be fragmented so as to expose internal breaks for later reconnection so as to obtain nucleic acid configuration information for a particular cell. A number of fragmentation approaches are known and are consistent with the disclosure herein. Nucleic acids can be fragmented using one or more populations of restriction endonucleases, programmable endonucleases such as CRISPR / Cas molecules coupled to guide RNA, non-specific endonucleases (e.g., DNase), tagmentation, shearing, sonication, heating, or other mechanism. In someAttorney Docket No. 45269-750.601 cases, the DNase is non-sequence specific. In some cases, the DNase is active for both single -stranded DNA and double-stranded DNA. In some cases, the DNase is specific for double -stranded DNA. In some cases, the DNase is preferential to double-stranded DNA. In some cases, the DNase is specific for singlestranded DNA. In some cases, the DNase is preferential to single -stranded DNA. In some cases, the DNase is DNase I. In some cases, the DNase is DNase II. In some cases, the DNase is selected from one or more of DNase I and DNase II. In some cases, the DNase is micrococcal nuclease. In some cases, the DNase is selected from one or more of DNase I, DNase II, and micrococcal nuclease. Other suitable nucleases are also within the scope of this disclosure.

[0147] In some embodiments, cross-linked DNA molecules may be subjected to a size selection step. Size selection of the nucleic acids may be performed to cross-linked DNA molecules below or above a certain size. Size selection may further be affected by the frequency of cross-links and / or by the fragmentation method. In some embodiments, a composition may be prepared comprising cross-linking a DNA molecule in the range of about 145 bp to about 600 bp, about 100 bp to about 2500 bp, about 600 to about 2500 bp, about 350 bp to about 1000 bp, or any range bounded by any of these values (e.g., about 100 bp to about 2500 bp).

[0148] In some embodiments, sample polynucleotides are fragmented into a population of fragmented DNA molecules of one or more specific size range(s). In some embodiments, fragments can be generated from at least about 1, about 2, about 5, about 10, about 20, about 50, about 100, about 200, about 500, about 1000, about 2000, about 5000, about 10,000, about 20,000, about 50,000, about 100,000, about 200,000, about 500,000, about 1,000,000, about 2,000,000, about 5,000,000, about 10,000,000, or more genome-equivalents of starting DNA. Fragmentation may be accomplished by DNase treatment. In some embodiments, the fragments have an average length from about 10 to about 10,000, about 20,000, about 30,000, about 40,000, about 50,000, about 60,000, about 70,000, about 80,000, about 90,000, about 100,000, about 150,000, about 200,000, about 300,000, about 400,000, about 500,000, about 600,000, about 700,000, about 800,000, about 900,000, about 1,000,000, about 2,000,000, about 5,000,000, about 10,000,000, or more nucleotides. In some embodiments, the fragments have an average length from about 145 bp to about 600 bp, about 100 bp to about 2500 bp, about 600 to about 2500 bp, about 350 bp to about 1000 bp, or any range bounded by any of these values (e.g., about 100 bp to about 2500 bp). In some embodiments, the fragments have an average length less than about 2500 bp, less than about 1200 bp, less than about 1000 bp, less than about 800 bp, less than about 600 bp, less than about 350 bp, or less than about 200 bp. In other embodiments, the fragments have an average length more than about 100 bp, more than about 350 bp, more than about 600 bp, more than about 800 bp, more than about 1000 bp, more than about 1200 bp, or more than about 2000 bp. Non-limiting examples of DNases include DNase I, DNase II, micrococcal nuclease, variants thereof, and combinations thereof. For example, digestion with DNase I can induce random double-stranded breaks in DNA in the absence of Mg++ and in the presence of Mn++. Fragmentation can produce fragments having 5’ overhangs, 3’ overhangs, blunt ends,Attorney Docket No. 45269-750.601 or a combination thereof. In some embodiments, the method includes the step of size selecting the fragments via standard methods such as column purification or isolation from an agarose gel.Targeted Nuclease Enzymes

[0149] Fragmented DNA as provided herein may be created or generated by digestion, such as by in situ digestion with any number of nucleases (e.g., restriction endonucleases) or DNases (e.g., MNase). In some cases, enzymes may be used in combination to achieve the desired digestion or fragmentation. In various cases, nucleases (or domains or fragments thereof) may be targeted to certain genomic sites using one or more antibodies. For example, the crosslinked sample may be contacted to an antibody that binds to certain regions of the DNA, such as a histone binding site, a transcription factor binding site, or a methylated DNA site. A nuclease linked or fused to an immunoglobulin binding protein or fragment thereof, such as a Protein A, a Protein G, a Protein A / G, or a Protein L, can then be added to the sample and the nuclease may digest the DNA only in the region where the antibody bound. This may be done in combination, for example, where a first antibody is bound to the DNA sample, then the nuclease is targeted to the first antibody, then a second antibody is bound to the DNA sample and the nuclease is targeted to the second antibody, and so on to achieve the desired digestion pattern.Ligation

[0150] In some embodiments, the 5 ’ and / or 3 ’ end nucleotide sequences of fragmented DNA are not modified prior to ligation. For example, cleavage by an enzyme that leaves a predictable blunt end can be followed by ligation of blunt-ended DNA fragments to nucleic acids, such as adaptors, oligonucleotides, or polynucleotides, comprising a blunt end. In some embodiments, the fragmented DNA molecules are blunt-end polished (or “end repaired”) to produce DNA fragments having blunt ends, prior to being joined to adaptors. The blunt-end polishing step may be accomplished by incubation with a suitable enzyme, such as a DNA polymerase that has both 3 ’ to 5 ’ exonuclease activity and 5 ’ to 3 ’ polymerase activity, for example, T4 polymerase. In some embodiments, end repair can be followed by an addition of 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 or more nucleotides, such as one or more adenine, one or more thymine, one or more guanine, or one or more cytosine, to produce an overhang. For example, the end pair can be followed by an addition of 1, 2, 3, 4, 5, or 6 nucleotides. DNA fragments having an overhang can be joined to one or more nucleic acids, such as oligonucleotides, adaptor oligonucleotides, or polynucleotides, having a complementary overhang, such as in a ligation reaction. For example, a single adenine can be added to the 3’ ends of end repaired DNA fragments using a template independent polymerase, followed by ligation to one or more adaptors each having a thymine at a 3’ end. In some embodiments, nucleic acids, such as oligonucleotides or polynucleotides can be joined to blunt end double-stranded DNA molecules which have been modified by extension of the 3’ end with one or more nucleotides followed by 5’ phosphorylation. In some cases, extension of the 3’ end may be performed with a polymerase such as, Klenow polymerase or any of the suitable polymerases provided herein, or by use of a terminal deoxynucleotide transferase, in the presence of one or more dNTPs in a suitable buffer that can contain magnesium. In some embodiments, target polynucleotidesAttomey Docket No. 45269-750.601 having blunt ends are joined to one or more adaptors comprising a blunt end. Phosphorylation of 5 ’ ends of DNA fragment molecules may be performed, for example, with T4 polynucleotide kinase in a suitable buffer containing ATP and magnesium. The fragmented DNA molecules may optionally be treated to dephosphorylate 5’ ends or 3’ ends, for example, by using enzymes such as phosphatases.

[0151] The terms “connecting,” “joining,” and “ligation” as used herein, with respect to two polynucleotides, such as an adaptor oligonucleotide and a target polynucleotide, refers to the covalent attachment of two separate DNA segments to produce a single larger polynucleotide with a contiguous backbone. Methods for joining two DNA segments include, without limitation, enzymatic and non- enzymatic (e.g., chemical) methods. Examples of ligation reactions that are non-enzymatic include the non-enzymatic ligation techniques described in U.S. Pat. Nos. 5,780,613 and 5,476,930, which are herein incorporated by reference. In some embodiments, an adaptor oligonucleotide is joined to a target polynucleotide by a ligase, for example, a DNA ligase or RNA ligase. Multiple ligases, each having characterized reaction conditions include, without limitation, NAD+-dependent ligases including tRNA ligase, Taq DNA ligase, Thermus filiformis DNA ligase, Escherichia coli DNA ligase, Tth DNA ligase, Thermus scotoductus DNA ligase (I and II), thermostable ligase, Ampligase thermostable DNA ligase, VanC-type ligase, 9° N DNA Ligase, Tsp DNA ligase, and novel ligases discovered by bioprospecting; ATP -dependent ligases including T4 RNA ligase, T4 DNA ligase, T3 DNA ligase, T7 DNA ligase, Pfu DNA ligase, DNA ligase 1, DNA ligase III, DNA ligase IV, and novel ligases discovered by bioprospecting; and wild-type, mutant isoforms, and genetically engineered variants thereof.

[0152] Ligation can be between DNA segments having hybridizable sequences, such as complementary overhangs. Ligation can also be between two blunt ends. Generally, a 5’ phosphate is utilized in a ligation reaction. The 5’ phosphate can be provided by the target polynucleotide, the adaptor oligonucleotide, or both. 5’ phosphates can be added to or removed from DNA segments to be joined, as needed. Methods for the addition or removal of 5’ phosphates include, without limitation, enzymatic and chemical processes. Enzymes useful in the addition and / or removal of 5’ phosphates include kinases, phosphatases, and polymerases. In some embodiments, both of the two ends joined in a ligation reaction (e.g., an adaptor end and a target polynucleotide end) provide a 5 ’ phosphate, such that two covalent linkages are made in joining the two ends. In some embodiments, only one of the two ends joined in a ligation reaction (e.g., only one of an adaptor end and a target polynucleotide end) provides a 5’ phosphate, such that only one covalent linkage is made in joining the two ends.

[0153] In some embodiments, only one strand at one or both ends of a target polynucleotide is joined to an adaptor oligonucleotide. In some embodiments, both strands at one or both ends of a target polynucleotide are joined to an adaptor oligonucleotide. In some embodiments, 3’ phosphates are removed prior to ligation. In some embodiments, an adaptor oligonucleotide is added to both ends of a target polynucleotide, wherein one or both strands at each end are joined to one or more adaptor oligonucleotides. When both strands at both ends are joined to an adaptor oligonucleotide, joining can be followed by a cleavage reaction that leaves a 5’ overhang that can serve as a template for the extension ofAttorney Docket No. 45269-750.601 the corresponding 3 ’ end, which 3 ’ end may or may not include one or more nucleotides derived from the adaptor oligonucleotide. In some embodiments, a target polynucleotide is joined to a first adaptor oligonucleotide on one end and a second adaptor oligonucleotide on the other end. In some embodiments, two ends of a target polynucleotide are joined to the opposite ends of a single adaptor oligonucleotide. In some embodiments, the target polynucleotide and the adaptor oligonucleotide to which it is joined comprise blunt ends. In some embodiments, separate ligation reactions can be carried out for each sample, using a different first adaptor oligonucleotide comprising at least one barcode sequence for each sample, such that no barcode sequence is joined to the target polynucleotides of more than one sample. A DNA segment or a target polynucleotide that has an adaptor oligonucleotide joined to it is considered “tagged” by the joined adaptor.

[0154] In some cases, the ligation reaction can be performed at a DNA segment or target polynucleotide concentration of about 0.1 ng / pL, about 0.2 ng / pL, about 0.3 ng / pL, about 0.4 ng / pL, about 0.5 ng / pL, about 0.6 ng / pL, about 0.7 ng / pL, about 0.8 ng / pL, about 0.9 ng / pL, about 1.0 ng / pL, about 1.2 ng / pL, about 1.4 ng / pL, about 1.6 ng / pL, about 1.8 ng / pL, about 2.0 ng / pL, about 2.5 ng / pL, about 3.0 ng / pL, about 3.5 ng / pL, about 4.0 ng / pL, about 4.5 ng / pL, about 5.0 ng / pL, about 6.0 ng / pL, about 7.0 ng / pL, about 8.0 ng / pL, about 9.0 ng / pL, about 10 ng / pL, about 15 ng / pL, about 20 ng / pL, about 30 ng / pL, about 40 ng / pL, about 50 ng / pL, about 60 ng / pL, about 70 ng / pL, about 80 ng / pL, about 90 ng / pL, about 100 ng / pL, about 150 ng / pL, about 200 ng / pL, about 300 ng / pL, about 400 ng / pL, about 500 ng / pL, about 600 ng / pL, about 800 ng / pL, or about 1000 ng / pL. For example, the ligation can be performed at a DNA segment or target polynucleotide concentration of about 100 ng / pL, about 150 ng / pL, about 200 ng / pL, about 300 ng / pL, about 400 ng / pL, or about 500 ng / pL.

[0155] In some cases, the ligation reaction can be performed at a DNA segment or target polynucleotide concentration of about 0.1 to 1000 ng / pL, about 1 to 1000 ng / pL, about 1 to 800 ng / pL, about 10 to 800 ng / pL, about 10 to 600 ng / pL, about 100 to 600 ng / pL, or about 100 to 500 ng / pL.

[0156] In some cases, the ligation reaction can be performed for more than about 5 minutes, about 10 minutes, about 20 minutes, about 30 minutes, about 40 minutes, about 50 minutes, about 60 minutes, about 90 minutes, about 2 hours, about 3 hours, about 4 hours, about 5 hours, about 6 hours, about 8 hours, about 10 hours, about 12 hours, about 18 hours, about 24 hours, about 36 hours, about 48 hours, or about 96 hours. In other cases, the ligation reaction can be performed for less than about 5 minutes, about 10 minutes, about 20 minutes, about 30 minutes, about 40 minutes, about 50 minutes, about 60 minutes, about 90 minutes, about 2 hours, about 3 hours, about 4 hours, about 5 hours, about 6 hours, about 8 hours, about 10 hours, about 12 hours, about 18 hours, about 24 hours, about 36 hours, about 48 hours, or about 96 hours. For example, the ligation reaction can be performed for about 30 minutes to about 90 minutes. In some embodiments, joining of an adaptor to a target polynucleotide produces a joined product polynucleotide having a 3 ’ overhang comprising a nucleotide sequence derived from the adaptor.

[0157] In some embodiments, after joining at least one adaptor oligonucleotide to a target polynucleotide, the 3’ end of one or more target polynucleotides is extended using the one or more joinedAttorney Docket No. 45269-750.601 adaptor oligonucleotides as template. For example, an adaptor comprising two hybridized oligonucleotides that is joined to only the 5 ’ end of a target polynucleotide allows for the extension of the unjoined 3 ’ end of the target using the joined strand of the adaptor as template, concurrently with or following displacement of the unjoined strand. Both strands of an adaptor comprising two hybridized oligonucleotides may be joined to a target polynucleotide such that the joined product has a 5 ’ overhang, and the complementary 3’ end can be extended using the 5’ overhang as template. As a further example, a hairpin adaptor oligonucleotide can be joined to the 5’ end of a target polynucleotide. In some embodiments, the 3 ’ end of the target polynucleotide that is extended comprises one or more nucleotides from an adaptor oligonucleotide. For target polynucleotides to which adaptors are joined on both ends, extension can be carried out for both 3’ ends of a double -stranded target polynucleotide having 5’ overhangs. This 3’ end extension, or “fill-in” reaction, generates a complementary sequence, or “complement,” to the adaptor oligonucleotide template that is hybridized to the template, thus filling in the 5’ overhang to produce a double -stranded sequence region. Where both ends of a double-stranded target polynucleotide have 5’ overhangs that are filled in by extension of the complementary strands’ 3’ ends, the product is completely double -stranded. Extension can be carried out by any suitable polymerase, such as a DNA polymerase, many of which are commercially available. DNA polymerases can comprise DNA-dependent DNA polymerase activity, RNA-dependent DNA polymerase activity, or DNA-dependent and RNA-dependent DNA polymerase activity. DNA polymerases can be thermostable or non-thermostable. Examples of DNA polymerases include, but are not limited to, Taq polymerase, Tth polymerase, Tli polymerase, Pfu polymerase, Pfutubo polymerase, Pyrobest polymerase, Pwo polymerase, KOD polymerase, Bst polymerase, Sac polymerase, Sso polymerase, Poc polymerase, Pab polymerase, Mth polymerase, Pho polymerase, ES4 polymerase, VENT polymerase, DEEPVENT polymerase, EX -Taq polymerase, LA -Taq polymerase, Expand polymerases, Platinum Taq polymerases, Hi-Fi polymerase, Tbr polymerase, Tfl polymerase, Tru polymerase, Tac polymerase, Tne polymerase, Tma polymerase, Tih polymerase, Tfi polymerase, Klenow fragment, and variants, modified products and derivatives thereof 3 ’ end extension can be performed before or after pooling of target polynucleotides from independent samples.Target Enrichment

[0158] In certain embodiments, the disclosure provides methods for the enrichment nucleic acids in a genome. In some cases, the methods for enrichment are in a solution -based format. In some cases, the target nucleic acid can be labeled with a labeling agent. In other cases, the target nucleic acid can be crosslinked to one or more association molecules that are labeled with a labeling agent. Examples of labeling agents include, but are not limited to, biotin, polyhistidine tags, and chemical tags (e.g., alkyne and azide derivatives used in Click Chemistry methods). Further, the labeled target nucleic acid can be captured and thereby enriched by using a capturing agent. The capturing agent can be streptavidin and / or avidin, an antibody, a chemical moiety (e.g., alkyne, azide), and any biological, chemical, physical, or enzymatic agents used for affinity purification.Attorney Docket No. 45269-750.601

[0159] In some cases, immobilized or non-immobilized nucleic acid probes can be used to capture the target nucleic acids. For example, the target nucleic acids can be enriched from a sample by hybridization to the probes on a solid support or in solution. In some examples, the sample can be a genomic sample. In some examples, the probes can be an amplicon. The amplicon can comprise a predetermined sequence. Further, the hybridized target nucleic acids can be washed and / or eluted from the probes. The target nucleic acid can be a DNA, RNA, cDNA, or mRNA molecule.

[0160] In some cases, the enrichment method can comprise contacting the sample comprising the target nucleic acid to the probes and binding the target nucleic acid to a solid support. In some cases, the sample can be fragmented using enzymatic methods to yield the target nucleic acids. In some cases, the probes can be specifically hybridized to the target nucleic acids. In some cases, the target nucleic acids can have an average size of about 145 bp to about 600 bp, about 100 bp to about 2500 bp, about 600 to about 2500 bp, or about 350 bp to about 1000 bp. The target nucleic acids can be further separated from the unbound nucleic acids in the sample. The solid support can be washed and / or eluted to provide the enriched target nucleic acids. In some examples, the enrichment steps can be repeated for about 1, 2, 3, 4, 5, 6, 7, 8, 9, or 10 times. For example, the enrichment steps can be repeated for about 1, 2, or 3 times.

[0161] In some cases, the enrichment method can comprise providing probe derived amplicons wherein said probes for amplification are attached to a solid support. The solid support can comprise support- immobilized nucleic acid probes to capture specific target nucleic acid from a sample. The probe derived amplicons can hybridize to the target nucleic acids. Following hybridization to the probe amplicons, the target nucleic acids in the sample can be enriched by capturing (e.g., via capturing agents as biotin, antibodies, etc.) and washing and / or eluting the hybridized target nucleic acids from the captured probes. The target nucleic acid sequence(s) may be further amplified using, for example, PCR methods to produce an amplified pool of enriched PCR products.

[0162] In some cases, the solid support can be a microarray, a slide, a chip, a microwell, a column, a tube, a particle, or a bead. In some examples, the solid support can be coated with streptavidin and / or avidin. In other examples, the solid support can be coated with an antibody. Further, the solid support can comprise a glass, metal, ceramic or polymeric material. In some embodiments, the solid support can be a nucleic acid microarray (e.g., a DNA microarray). In other embodiments, the solid support can be a paramagnetic bead.

[0163] In particular embodiments, the disclosure provides methods for amplifying the enriched DNA. In some cases, the enriched DNA is a read-pair. The read-pair can be obtained by the methods of the present disclosure.

[0164] In some embodiments, the one or more amplification and / or replication steps are used for the preparation of a library to be sequenced. Any suitable amplification method may be used. Examples of amplification techniques that can be used include, but are not limited to, quantitative PCR, quantitative fluorescent PCR (QF-PCR), multiplex fluorescent PCR (MF -PCR), real time PCR (RTPCR), single cell PCR, restriction fragment length polymorphism PCR (PCR-RFLP), PCK-RFLPIRT-PCR-IRFLP, hotAttorney Docket No. 45269-750.601 start PCR, nested PCR, in situ polony PCR, in situ rolling circle amplification (RCA), bridge PCR , ligation mediated PCR, Qb replicase amplification, inverse PCR, picotiter PCR and emulsion PCR. Other suitable amplification methods include the ligase chain reaction (LCR), transcription amplification, selfsustained sequence replication, selective amplification of target polynucleotide sequences, consensus sequence primed polymerase chain reaction (CP-PCR), arbitrarily primed polymerase chain reaction (AP-PCR), degenerate oligonucleotide-primed PCR (DOP-PCR) and nucleic acid-based sequence amplification (NAB SA). Other amplification methods that can be used herein include those described in U.S. Patent Nos. 5,242,794; 5,494,810; 4,988,617; and 6,582,938.

[0165] In particular embodiments, PCR is used to amplify DNA molecules after they are dispensed into individual partitions. In some cases, one or more specific priming sequences within amplification adaptors are utilized for PCR amplification. The amplification adaptors may be ligated to fragmented DNA molecules before or after dispensing into individual partitions. Polynucleotides comprising amplification adaptors with suitable priming sequences on both ends can be PCR amplified exponentially. Polynucleotides with only one suitable priming sequence due to, for example, imperfect ligation efficiency of amplification adaptors comprising priming sequences, may only undergo linear amplification. Further, polynucleotides can be eliminated from amplification, for example, PCR amplification, all together, if no adaptors comprising suitable priming sequences are ligated. In some embodiments, the number of PCR cycles vary between 10-30, but can be as low as 9, 8, 7, 6, 5, 4, 3, 2 or less or as high as 40, 45, 50, 55, 60 or more. As a result, exponentially amplifiable fragments carrying amplification adaptors with a suitable priming sequence can be present in much higher (1000 -fold or more) concentration compared to linearly amplifiable or un -amplifiable fragments, after a PCR amplification. Benefits of PCR, as compared to whole genome amplification techniques (such as amplification with randomized primers or Multiple Displacement Amplification using phi29 polymerase) include, but are not limited to, a more uniform relative sequence coverage - as each fragment can be copied at most once per cycle and as the amplification is controlled by thermocycling program, a substantially lower rate of forming chimeric molecules than, for example, MDA (Lasken et al., 2007, BMC Biotechnology) - as chimeric molecules pose significant challenges for accurate sequence assembly by presenting nonbiological sequences in the assembly graph, which may result in higher rate of misassemblies or highly ambiguous and fragmented assembly, reduced sequence specific biases that may result from binding of randomized primers commonly used in MDA versus using specific priming sites with a specific sequence, a higher reproducibility in the amount of final amplified DNA product, which can be controlled by selection of the number of PCR cycles, and a higher fidelity in replication with the polymerases that are commonly used in PCR as compared to common whole genome amplification techniques.

[0166] In some embodiments, the fill-in reaction is followed by or performed as part of amplification of one or more target polynucleotides using a first primer and a second primer, wherein the first primer comprises a sequence that is hybridizable to at least a portion of the complement of one or more of theAttorney Docket No. 45269-750.601 first adaptor oligonucleotides, and further wherein the second primer comprises a sequence that is hybridizable to at least a portion of the complement of one or more of the second adaptor oligonucleotides. Each of the first and second primers may be of any suitable length, such as about, less than about, or more than about 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 90, 100, or more nucleotides, any portion or all of which may be complementary to the corresponding target sequence (e.g., about, less than about, or more than about 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, or more nucleotides). For example, about 10 to 50 nucleotides can be complementary to the corresponding target sequence.

[0167] “Amplification” refers to any process by which the copy number of a target sequence is increased. In some cases, a replication reaction may produce only a single complementary copy / replica of a polynucleotide. Methods for primer-directed amplification of target polynucleotides include, without limitation, methods based on the polymerase chain reaction (PCR). Conditions favorable to the amplification of target sequences by PCR can be optimized at a variety of steps in the process, and depend on characteristics of elements in the reaction, such as target type, target concentration, sequence length to be amplified, sequence of the target and / or one or more primers, primer length, primer concentration, polymerase used, reaction volume, ratio of one or more elements to one or more other elements, and others, some or all of which can be altered. In general, PCR involves the steps of denaturation of the target to be amplified (if double stranded), hybridization of one or more primers to the target, and extension of the primers by a DNA polymerase, with the steps repeated (or “cycled”) in order to amplify the target sequence. Steps in this process can be optimized for various outcomes, such as to enhance yield, decrease the formation of spurious products, and / or increase or decrease specificity of primer annealing. Methods of optimization include, without limitation, adjustments to the type or number of elements in the amplification reaction and / or to the conditions of a given step in the process, such as temperature at a particular step, duration of a particular step, and / or number of cycles.

[0168] In some embodiments, an amplification reaction can comprise at least about 5, 10, 15, 20, 25, 30, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200 or more cycles. In some examples, an amplification reaction can comprise at least about 20, 25, 30, 35 or 40 cycles. In some embodiments, an amplification reaction comprises no more than about 5, 10, 15, 20, 25, 35, 40, 50, 60, 70, 80, 90, 100, 150, 200 or more cycles. Cycles can contain any number of steps, such as 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 or more steps. Steps can comprise any temperature or gradient of temperatures, suitable for achieving the purpose of the given step including, but not limited to, 3’ end extension (e.g., adaptor fill-in), primer annealing, primer extension, and strand denaturation. Steps can be of any duration including, but not limited to, about, less than about, or more than about 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 70, 80, 90, 100, 120, 180, 240, 300, 360, 420, 480, 540, 600, 1200, 1800, or more seconds, including indefinitely until manually interrupted. Cycles of any number comprising different steps can be combined in any order. In some embodiments, different cycles comprising different steps are combined such that the total number of cycles in the combination is about, less that about, or more than about 5, 10, 15, 20, 25, 30, 35, 40, 50,Attorney Docket No. 45269-750.60160, 70, 80, 90, 100, 150, 200 or more cycles. In some embodiments, amplification is performed following the fill-in reaction.

[0169] In some embodiments, the amplification reaction can be carried out on at least about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30, 40, 50, 100, 200, 300, 400, 500, 600, 800, 1000 ng of the target DNA molecule. In other embodiments, the amplification reaction can be carried out on less than about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 25, 30, 40, 50, 100, 200, 300, 400, 500, 600, 800, 1000 ng of the target DNA molecule.

[0170] Amplification can be performed before or after pooling of target polynucleotides from independent samples.

[0171] Methods of the disclosure involve determining an amount of amplifiable nucleic acid present in a sample. Any known method may be used to quantify amplifiable nucleic acid, and an exemplary method is the polymerase chain reaction (PCR), specifically quantitative polymerase chain reaction (qPCR). qPCR is a technique based on the polymerase chain reaction and is used to amplify and simultaneously quantify a targeted nucleic acid molecule. qPCR allows for both detection and quantification (as absolute number of copies or relative amount when normalized to DNA input or additional normalizing genes) of a specific sequence in a DNA sample. The procedure follows the general principle of polymerase chain reaction, with the additional feature that the amplified DNA is quantified as it accumulates in the reaction in real time after each amplification cycle. QPCR is described, for example, in Kumit et al. (U.S. patent number 6,033,854), Wang et al. (U.S. patent number 5,567,583 and 5,348,853), Ma et al. (The Journal of American Science, 2(3), 2006), Heid et al. (Genome Research 986-994, 1996), Sambrook and Russell (Quantitative PCR, Cold Spring Harbor Protocols, 2006), and Higuchi (U.S. patent numbers 6,171,785 and 5,994,056). The contents of these are incorporated by reference herein in their entirety.

[0172] Other methods of quantification include use of fluorescent dyes that intercalate with double stranded DNA, and modified DNA oligonucleotide probes that fluoresce when hybridized with a complementary DNA. These methods can be broadly used but are also specifically adapted to real-time PCR as described in further detail as an example. In the first method, a DNA-binding dye binds to all double-stranded (ds)DNA in PCR, resulting in fluorescence of the dye. An increase in DNA product during PCR therefore leads to an increase in fluorescence intensity and is measured at each cycle, thus allowing DNA concentrations to be quantified. The reaction is prepared similarly to a standard PCR reaction, with the addition of fluorescent (ds)DNA dye. The reaction is run in a thermocycler, and after each cycle, the levels of fluorescence are measured with a detector; the dye only fluoresces when bound to the (ds)DNA (i.e., the PCR product). With reference to a standard dilution, the (ds)DNA concentration in the PCR can be determined. Uike other real-time PCR methods, the values obtained do not have absolute units associated with it. A comparison of a measured DNA / RNA sample to a standard dilution gives a fraction or ratio of the sample relative to the standard, allowing relative comparisons between different tissues or experimental conditions. To ensure accuracy in the quantification and / or expression ofAttorney Docket No. 45269-750.601 a target gene can be normalized with respect to a stably expressed gene. Copy numbers of unknown genes can similarly be normalized relative to genes of known copy number.

[0173] The second method uses a sequence-specific RNA or DNA-based probe to quantify only the DNA containing a probe sequence; therefore, use of the reporter probe significantly increases specificity, and allows quantification even in the presence of some non-specific DNA amplification. This allows for multiplexing, i.e., assaying for several genes in the same reaction by using specific probes with differently colored labels, provided that all genes are amplified with similar efficiency.

[0174] This method is commonly carried out with a DNA-based probe with a fluorescent reporter (e.g., 6-carboxyfluorescein) at one end and a quencher (e.g., 6-carboxy-tetramethylrhodamine) of fluorescence at the opposite end of the probe. The close proximity of the reporter to the quencher prevents detection of its fluorescence. Breakdown of the probe by the 5’ to 3’ exonuclease activity of a polymerase (e.g., Taq polymerase) breaks the reporter-quencher proximity and thus allows unquenched emission of fluorescence, which can be detected. An increase in the product targeted by the reporter probe at each PCR cycle results in a proportional increase in fluorescence due to breakdown of the probe and release of the reporter. The reaction is prepared similarly to a standard PCR reaction, and the reporter probe is added. As the reaction commences, during the annealing stage of the PCR both probe and primers anneal to the DNA target. Polymerization of a new DNA strand is initiated from the primers, and once the polymerase reaches the probe, its 5 ’-3 ’-exonuclease degrades the probe, physically separating the fluorescent reporter from the quencher, resulting in an increase in fluorescence. Fluorescence is detected and measured in a real-time PCR thermocycler, and geometric increase of fluorescence corresponding to exponential increase of the product is used to determine the threshold cycle in each reaction.

[0175] Relative concentrations of DNA present during the exponential phase of the reaction are determined by plotting fluorescence against cycle number on a logarithmic scale (so an exponentially increasing quantity will give a straight line). A threshold for detection of fluorescence above background is determined. The cycle at which the fluorescence from a sample crosses the threshold is called the cycle threshold, Ct. Since the quantity of DNA doubles every cycle during the exponential phase, relative amounts of DNA can be calculated, e.g., a sample with a Ct of 3 cycles earlier than another has 23 = 8 times more template. Amounts of nucleic acid (e.g., RNA or DNA) are then determined by comparing the results to a standard curve produced by a real-time PCR of serial dilutions (e.g., undiluted, 1 :4, 1: 16, 1 :64) of a known amount of nucleic acid.

[0176] In certain embodiments, the qPCR reaction involves a dual fluorophore approach that takes advantage of fluorescence resonance energy transfer (FRET), e.g., LIGHTCYCLER hybridization probes, where two oligonucleotide probes anneal to the amplicon (see, e.g., U.S. patent number 6, 174,670). The oligonucleotides are designed to hybridize in a head-to-tail orientation with the fluorophores separated at a distance that is compatible with efficient energy transfer. Other examples of labeled oligonucleotides that are structured to emit a signal when bound to a nucleic acid or incorporated into an extension product include: SCORPIONS probes (e.g., Whitcombe et al., Nature BiotechnologyAttorney Docket No. 45269-750.60117:804-807, 1999, and U.S. patent number 6,326,145), Sunrise (or AMPLIFLOUR) primers (e.g., Nazarenko et al., Nuc. Acids Res. 25:2516-2521, 1997, and U.S. patent number 6,117,635), and UUX primers and MOUECUUAR BEACONS probes (e.g., Tyagi et al., Nature Biotechnology 14:303-308, 1996 and U.S. patent number 5,989,823).

[0177] In other embodiments, a qPCR reaction uses fluorescent Taqman methodology and an instrument capable of measuring fluorescence in real time (e.g., ABI Prism 7700 Sequence Detector). The Taqman reaction uses a hybridization probe labeled with two different fluorescent dyes. One dye is a reporter dye (6-carboxyfluorescein), the other is a quenching dye (6-carboxy-tetramethylrhodamine). When the probe is intact, fluorescent energy transfer occurs and the reporter dye fluorescent emission is absorbed by the quenching dye. During the extension phase of the PCR cycle, the fluorescent hybridization probe is cleaved by the 5 ’-3’ nucleolytic activity of the DNA polymerase. On cleavage of the probe, the reporter dye emission is no longer transferred efficiently to the quenching dye, resulting in an increase of the reporter dye fluorescent emission spectra. Any nucleic acid quantification method, including real-time methods or single-point detection methods may be used to quantify the amount of nucleic acid in the sample. The detection can be performed by several different methodologies (e.g., staining, hybridization with a labeled probe; incorporation of biotinylated primers followed by avidin -enzyme conjugate detection; incorporation of 32P-labeled deoxynucleotide triphosphates, such as dCTP or dATP, into the amplified segment), as well as any other suitable detection method for nucleic acid quantification. The quantification may or may not include an amplification step.

[0178] In some embodiments, the disclosure provides labels for identifying or quantifying the linked DNA segments. In some cases, the linked DNA segments can be labeled in order to assist in downstream applications, such as array hybridization. For example, the linked DNA segments can be labeled using random priming or nick translation.

[0179] A wide variety of labels (e.g., reporters) may be used to label the nucleotide sequences described herein including, but not limited to, during the amplification step. Suitable labels include radionuclides, enzymes, fluorescent, chemiluminescent, or chromogenic agents as well as ligands, cofactors, inhibitors, magnetic particles, and the like. Examples of such labels are included in U.S. Pat. No. 3,817,837; U.S. Pat. No. 3,850,752; U.S. Pat. No. 3,939,350; U.S. Pat. No. 3,996,345; U.S. Pat. No. 4,277,437; U.S. Pat. No. 4,275,149 and U.S. Pat. No. 4,366,241, which are incorporated by reference in its entirety.

[0180] Additional labels include, but are not limited to, [3-galactosidase, invertase, green fluorescent protein, luciferase, chloramphenicol, acetyltransferase, [3-glucuronidase, exo-glucanase and glucoamylase. Fluorescent labels may also be used, as well as fluorescent reagents specifically synthesized with particular chemical properties. A wide variety of ways to measure fluorescence are available. For example, some fluorescent labels exhibit a change in excitation or emission spectra, some exhibit resonance energy transfer where one fluorescent reporter loses fluorescence, while a second gains in fluorescence, some exhibit a loss (quenching) or appearance of fluorescence, while some report rotational movements.Attorney Docket No. 45269-750.601

[0181] Further, in order to obtain sufficient material for labeling, multiple amplifications may be pooled, instead of increasing the number of amplification cycles per reaction. Alternatively, labeled nucleotides can be incorporated into the last cycles of the amplification reaction, e.g., 30 cycles of PCR (no label) +10 cycles of PCR (plus label).

[0182] In particular embodiments, the disclosure provides probes that can attach to the linked DNA segments. As used herein, the term “probe” refers to a molecule (e.g., an oligonucleotide, whether occurring naturally as in a purified restriction digest or produced synthetically, recombinantly or by PCR amplification), that is capable of hybridizing to another molecule of interest (e.g., another oligonucleotide). When probes are oligonucleotides, they may be single -stranded or double-stranded. Probes are useful in the detection, identification, and isolation of particular targets (e.g., gene sequences). In some cases, the probes may be associated with a label so that is detectable in any detection system including, but not limited to, enzyme (e.g., ELISA, as well as enzyme-based histochemical assays), fluorescent, radioactive, and luminescent systems.

[0183] With respect to arrays and microarrays, the term “probe” is used to refer to any hybridizable material that is affixed to the array for the purpose of detecting a nucleotide sequence that has hybridized to said probe. In some cases, the probes can about 10 bp to 500 bp, about 10 bp to 250 bp, about 20 bp to 250 bp, about 20 bp to 200 bp, about 25 bp to 200 bp, about 25 bp to 100 bp, about 30 bp to 100 bp, or about 30 bp to 80 bp. In some cases, the probes can be greater than about 10 bp, about 20 bp, about 30 bp, about 40 bp, about 50 bp, about 60 bp, about 70 bp, about 80 bp, about 90 bp, about 100 bp, about 150 bp, about 200 bp, about 250 bp, about 300 bp, about 400 bp, or about 500 bp in length. For example, the probes can be about 20 to about 50 bp in length. Examples and rationale for probe design can be found in WO95 / 11995, EP 717,113 and WO97 / 29212.

[0184] The probes, array of probes or set of probes can be immobilized on a support. Supports (e.g., solid supports) can be made of a variety of materials — such as glass, silica, plastic, nylon, or nitrocellulose. Supports can be rigid and have a planar surface. Supports can have from about 1 to 10,000,000 resolved loci. For example, a support can have about 10 to 10,000,000, about 10 to 5,000,000, about 100 to 5,000,000, about 100 to 4,000,000, about 1000 to 4,000,000, about 1000 to 3,000,000, about 10,000 to 3,000,000, about 10,000 to 2,000,000, about 100,000 to 2,000,000, or about 100,000 to 1,000,000 resolved loci. The density of resolved loci can be at least about 10, about 100, about 1000, about 10,000, about 100,000 or about 1,000,000 resolved loci within a square centimeter. In some cases, each resolved locus can be occupied by >95% of a single type of oligonucleotide. In other cases, each resolved locus can be occupied by pooled mixtures of probes or a set of probes. In further cases, some resolved loci are occupied by pooled mixtures of probes or a set of probes, and other resolved loci are occupied by >95% of a single type of oligonucleotide.

[0185] In some cases, the number of probes for a given nucleotide sequence on the array can be in large excess to the DNA sample to be hybridized to such array. For example, the array can have about 10,Attorney Docket No. 45269-750.601 about 100, about 1000, about 10,000, about 100,000, about 1,000,000, about 10,000,000, or about 100,000,000 times the number of probes relative to the amount of DNA in the input sample.

[0186] In some cases, an array can have about 10, about 100, about 1000, about 10,000, about 100,000, about 1,000,000, about 10,000,000, about 100,000,000, or about 1,000,000,000 probes.

[0187] Arrays of probes or sets of probes may be synthesized in a step-by-step manner on a support or can be attached in presynthesized form. One method of synthesis is VLSIPS™ (as described in U.S. Pat. No. 5,143,854 and EP 476,014), which entails the use of light to direct the synthesis of oligonucleotide probes in high-density, miniaturized arrays. Algorithms for design of masks to reduce the number of synthesis cycles are described in U.S. Pat. No. 5,571,639 and U.S. Pat. No. 5,593,839. Arrays can also be synthesized in a combinatorial fashion by delivering monomers to cells of a support by mechanically constrained flowpaths, as described in EP 624,059. Arrays can also be synthesized by spotting reagents on to a support using an inkjet printer (see, for example, EP 728,520).

[0188] In some embodiments, the present disclosure provides methods for hybridizing the linked DNA segments onto an array. A “substrate” or an “array” is an intentionally created collection of nucleic acids which can be prepared either synthetically or biosynthetically and screened for biological activity in a variety of different formats (e.g., libraries of soluble molecules; and libraries of oligonucleotides tethered to resin beads, silica chips, or other solid supports). Additionally, the term “array” includes those libraries of nucleic acids which can be prepared by spotting nucleic acids of essentially any length (e.g., from 1 to about 1000 nucleotide monomers in length) onto a substrate.

[0189] Array technology and the various associated techniques and applications are described generally in numerous textbooks and documents. For example, these include Lemieux et al., 1998, Molecular Breeding 4, 277-289; Schena and Davis, Parallel Analysis with Biological Chips, in PCR Methods Manual (eds. M. Innis, D. Gelfand, J. Sninsky); Schena and Davis, 1999, Genes, Genomes and Chips. In DNA Microarrays: A Practical Approach (ed. M. Schena), Oxford University Press, Oxford, UK, 1999); The Chipping Forecast (Nature Genetics special issue; January 1999 Supplement); Mark Schena (Ed.), Microarray Biochip Technology, (Eaton Publishing Company); Cortes, 2000, The Scientist 14

[0017] :25 ; Gwynn and Page, Microarray analysis: the next revolution in molecular biology, Science, 1999 Aug. 6; and Eakins and Chu, 1999, Trends in Biotechnology, 17, 217-218.

[0190] In general, any library may be arranged in an orderly manner into an array, by spatially separating the members of the library. Examples of suitable libraries for arraying include nucleic acid libraries (including DNA, cDNA, oligonucleotide, etc. libraries), peptide, polypeptide, and protein libraries, as well as libraries comprising any molecules, such as ligand libraries, among others.

[0191] The library can be fixed or immobilized onto a solid phase (e.g., a solid substrate), to limit diffusion and admixing of the members. In some cases, libraries of DNA binding ligands may be prepared. In particular, the libraries may be immobilized to a substantially planar solid phase, including membranes and non -porous substrates such as plastic and glass. Furthermore, the library can be arranged in such a way that indexing (i.e., reference or access to a particular member) is facilitated. In someAttorney Docket No. 45269-750.601 examples, the members of the library can be applied as spots in a grid formation. Common assay systems may be adapted for this purpose. For example, an array may be immobilized on the surface of a microplate, either with multiple members in a well, or with a single member in each well. Furthermore, the solid substrate may be a membrane, such as a nitrocellulose or nylon membrane (for example, membranes used in blotting experiments). Alternative substrates include glass, or silica-based substrates. Thus, the library can be immobilized by any suitable method, for example, by charge interactions, or by chemical coupling to the walls or bottom of the wells, or the surface of the membrane. Other means of arranging and fixing may be used, for example, pipetting, drop-touch, piezoelectric means, inkjet and bubblejet technology, electrostatic application, etc. In the case of silicon-based chips, photolithography may be utilized to arrange and fix the libraries on the chip.

[0192] The library may be arranged by being “spotted” onto the solid substrate; this may be done by hand or by making use of robotics to deposit the members. In general, arrays may be described as macroarrays or microarrays, the difference being the size of the spots. Macroarrays can contain spot sizes of about 300 microns or larger and may be easily imaged by existing gel and blot scanners. The spot sizes in microarrays can be less than 200 microns in diameter and these arrays usually contain thousands of spots. Thus, microarrays may require specialized robotics and imaging equipment, which may need to be custom made. Instrumentation is described generally in a review by Cortese, 2000, The Scientist 14

[0011] :26.

[0193] Techniques for producing immobilized libraries of DNA molecules have been described. Generally, most such methods describe how to synthesize single -stranded nucleic acid molecule libraries, using, for example, masking techniques to build up various permutations of sequences at the various discrete positions on the solid substrate. U.S. Pat. No. 5,837,832 describes an improved method for producing DNA arrays immobilized to silicon substrates based on very large-scale integration technology. In particular, U.S. Pat. No. 5,837,832 describes a strategy called “tiling” to synthesize specific sets of probes at spatially defined locations on a substrate which may be used to produce the immobilized DNA libraries of the present disclosure. U.S. Pat. No. 5,837,832 also provides references for earlier techniques that may also be used. In other cases, arrays may also be built using photo deposition chemistry.

[0194] Arrays of peptides (or peptidomimetics) may also be synthesized on a surface in a manner that places each distinct library member (e.g., unique peptide sequence) at a discrete, predefined location in the array. The identity of each library member is determined by its spatial location in the array. The locations in the array where binding interactions between a predetermined molecule (e.g., a target or probe) and reactive library members occur is determined, thereby identifying the sequences of the reactive library members on the basis of spatial location. These methods are described in U.S. Pat. No. 5,143,854; W090 / 15070 and W092 / 10092; Fodor et al. (1991) Science, 251: 767; Dower and Fodor (1991) Ann. Rep. Med. Chem., 26: 271Attorney Docket No. 45269-750.601

[0195] To aid detection, labels can be used (as discussed above) — such as any readily detectable reporter, for example, a fluorescent, bioluminescent, phosphorescent, radioactive, etc. reporter. Such reporters, their detection, coupling to targets / probes, etc. are discussed elsewhere in this document. Labelling of probes and targets is also disclosed in Shalon et al., 1996, Genome Res 6(7):639-45.

[0196] Examples of some commercially available microarray formats are set out in Marshall and Hodgson, 1998, Nature Biotechnology, 16(1), 27-31.

[0197] In order to generate data from array -based assays a signal can be detected to signify the presence of or absence of hybridization between a probe and a nucleotide sequence. Further, direct and indirect labeling techniques can also be utilized. For example, direct labeling incorporates fluorescent dyes directly into the nucleotide sequences that hybridize to the array associated probes (e.g., dyes are incorporated into nucleotide sequence by enzymatic synthesis in the presence of labeled nucleotides or PCR primers). Direct labeling schemes can yield strong hybridization signals, for example, by using families of fluorescent dyes with similar chemical structures and characteristics and can be simple to implement. In cases comprising direct labeling of nucleic acids, cyanine or alexa analogs can be utilized in multiple-fluor comparative array analyses. In other embodiments, indirect labeling schemes can be utilized to incorporate epitopes into the nucleic acids either prior to or after hybridization to the microarray probes. One or more staining procedures and reagents can be used to label the hybridized complex (e.g., a fluorescent molecule that binds to the epitopes, thereby providing a fluorescent signal by virtue of the conjugation of dye molecule to the epitope of the hybridized species).Performance

[0198] Analysis conducted with the techniques disclosed herein can be performed at high accuracy. Analysis can be conducted with an accuracy of at least about 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, 99.999% or more. Analysis can be conducted with an accuracy of at least 70%. Analysis can be conducted with an accuracy of at least 80%. Analysis can be conducted with an accuracy of at least 90%.

[0199] Analysis conducted with the techniques disclosed herein can be performed at high specificity. Analysis can be conducted with a specificity of at least about 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, 99.999% or more. Analysis can be conducted with a specificity of at least 70%. Analysis can be conducted with a specificity of at least 80%. Analysis can be conducted with a specificity of at least 90%.

[0200] Analysis conducted with the techniques disclosed herein can be performed at high sensitivity. Analysis can be conducted with a sensitivity of at least about 50%, 60%, 70%, 75%, 80%, 85%, 90%, 95%, 96%, 97%, 98%, 99%, 99.9%, 99.99%, 99.999% or more. Analysis can be conducted with a sensitivity of at least 70%. Analysis can be conducted with a sensitivity of at least 80%. Analysis can be conducted with a sensitivity of at least 90%.

[0201] Use of the techniques of the present disclosure can improve the functioning of the computer systems on which they are implemented. For example, the techniques can reduce the processing time for aAttorney Docket No. 45269-750.601 given analysis by at least about 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more. The techniques can reduce the memory requirements for a given analysis by at least about 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, 90%, 95%, or more.

[0202] Use of the techniques of the present disclosure can enable conducting analyses that were previously not possible. For example, certain genetic features can be detected from sequence information that would not be detectable from such information without the methods of the present disclosure.Computer systems

[0203] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG. 1 shows a computer system 101 that is programmed or otherwise configured to obtain a phased genome. The computer system 101 can regulate various aspects of genome assembly of the present disclosure, such as, for example, identify the presence of a k-mer in a sequence read, generate a k-mer from reference genome, combine sequences of a reference genome to generate an assembled genome. The computer system 101 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.

[0204] The computer system 101 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 105, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 101 also includes memory or memory location 110 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 115 (e.g., hard disk), communication interface 120 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 125, such as cache, other memory, data storage and / or electronic display adapters. The memory 110, storage unit 115, interface 120 and peripheral devices 125 are in communication with the CPU 105 through a communication bus (solid lines), such as a motherboard. The storage unit 115 can be a data storage unit (or data repository) for storing data. The computer system 101 can be operatively coupled to a computer network (“network”) 130 with the aid of the communication interface 120. The network 130 can be the Internet, an internet and / or extranet, or an intranet and / or extranet that is in communication with the Internet. The network 130 in some cases is a telecommunication and / or data network. The network 130 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 130, in some cases with the aid of the computer system 101, can implement a peer-to-peer network, which may enable devices coupled to the computer system 101 to behave as a client or a server.

[0205] The CPU 105 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 110. The instructions can be directed to the CPU 105, which can subsequently program or otherwise configure the CPU 105 to implement methods of the present disclosure. Examples of operations performed by the CPU 105 can include fetch, decode, execute, and writeback.Attorney Docket No. 45269-750.601

[0206] The CPU 105 can be part of a circuit, such as an integrated circuit. One or more other components of the system 101 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).

[0207] The storage unit 115 can store files, such as drivers, libraries, and saved programs. The storage unit 115 can store user data, e.g., user preferences and user programs. The computer system 101 in some cases can include one or more additional data storage units that are external to the computer system 101, such as located on a remote server that is in communication with the computer system 101 through an intranet or the Internet.

[0208] The computer system 101 can communicate with one or more remote computer systems through the network 130. For instance, the computer system 101 can communicate with a remote computer system of a user (e.g., a researcher elucidating a genome for an individual). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android -enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 101 via the network 130.

[0209] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 101, such as, for example, on the memory 110 or electronic storage unit 115. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 105. In some cases, the code can be retrieved from the storage unit 115 and stored on the memory 110 for ready access by the processor 105. In some situations, the electronic storage unit 115 can be precluded, and machine -executable instructions are stored on memory 110.

[0210] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre -compiled or as-compiled fashion.

[0211] Aspects of the systems and methods provided herein, such as the computer system 101, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and / or associated data that is carried on or embodied in a type of machine readable medium. Machineexecutable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random -access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of anAttorney Docket No. 45269-750.601 application server. Thus, another type of media that may bear the software elements includes optical, electrical, and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links, or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.

[0212] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer- readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and / or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution.

[0213] The computer system 101 can include or be in communication with an electronic display 135 that comprises a user interface (UI) 140 for providing, for example, sequencing data or reference genome panels. Examples of UIs include, without limitation, a graphical user interface (GUI) and web -based user interface.

[0214] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. One or more algorithms can be implemented by way of software upon execution by the central processing unit 105. The one or more algorithms can, for example, align a plurality of k-mers to a to a reference genome, or identify a plurality of k-mers in a plurality of sequencing reads.

[0215] FIG. 2 illustrates a schematic of the experimental methods described herein. Samples, such as cells, blood, tissue, or plants, may be prepared. This samples may then be processed by fixation, fragmentation, and proximity ligation. The DNA may then be purified. The sample may then undergo library conversion. Finally, the sample may be sequenced and the data analyzed. This process up through sequencing may be accomplished within a single day. High-quality Hi-C linkages may be sufficient for scaffolding. High coverage SNP information may enable phasing.Attorney Docket No. 45269-750.601

[0216] Table 1 illustrates metrics of different library types. Table 1 illustrates that the methods described herein achieve favorable metrics compared to other technologies.Table 1: Metrics of different library types.

[0217] FIG. 3 shows a pie graph of sample types that may be supported by the methods and systems described herein. Animal tissue 301, insect tissue 302, plants 303, marine tissue 304, and / or other samples 305 may be supported. The methods and systems described herein may support many sample types using lower initial sample inputs. Automation of the workflow may increase adoption by core labs.Definitions

[0218] Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood to one of ordinary skill in the art to which this disclosure belongs. Although any methods and reagents similar or equivalent to those described herein can be used in the practice of the disclosed methods and compositions, the exemplary methods and materials are now described.

[0219] As used herein and in the appended claims, the singular forms “a,” “and,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference.

[0220] to “contig” includes a plurality of such contigs and reference to “probing the physical layout of chromosomes” includes reference to one or more methods for probing the physical layout of chromosomes and equivalents thereof known to those skilled in the art, and so forth.

[0221] Also, the use of “and” means “and / or” unless stated otherwise. Similarly, “comprise,” “comprises,” “comprising,” “include,” “includes,” and “including” are interchangeable and not intended to be limiting.

[0222] It is to be further understood that where descriptions of various embodiments use the term “comprising,” those skilled in the art would understand that in some specific instances, an embodiment can be alternatively described using language “consisting essentially of’ or “consisting of.”Attorney Docket No. 45269-750.601

[0223] The term “sequencing read” as used herein, refers to a fragment of DNA in which the sequence has been determined.

[0224] The term “subject” as used herein can refer to any eukaryotic or prokaryotic organism.

[0225] The term “contigs” as used herein, refers to contiguous regions of DNA sequence. “Contigs” can be determined by any number methods known in the art, such as, by comparing sequencing reads for overlapping sequences, and / or by comparing sequencing reads against a database of known sequences in order to identify which sequencing reads have a high probability of being contiguous.

[0226] The term “read pair” or “read-pair” as used herein can refer to two or more elements that are linked to provide sequence information. In some cases, the number of read -pairs can refer to the number of mappable read-pairs. In other cases, the number of read-pairs can refer to the total number of generated read -pairs.

[0227] The term “stabilized” as used herein can describe a sample that has been preserved or otherwise protected from degradation. In some cases, a stabilized sample is crosslinked or treated with a fixative or crosslinking agent. In some cases, a stabilized sample is treated with formaldehyde, formalin, paraformaldehyde, glutaraldehyde, osmium tetroxide, or the like.

[0228] The term “target genome” as used herein refers to a genome that is to be assembled from data pertaining to a subject. The target genome can be generated using the methods of genome assembly described in this disclosure.

[0229] The term “about” as used herein can describe a number, unless otherwise specified, as a range of values including that number plus or minus 10% of that number.EXAMPLES

[0230] The following examples are given for the purpose of illustrating various embodiments of the invention and are not meant to limit the present invention in any fashion. The present examples, along with the methods described herein are presently representative of preferred embodiments, are exemplary, and are not intended as limitations on the scope of the invention. Changes therein and other uses which are encompassed within the spirit of the invention as defined by the scope of the claims will occur to those skilled in the art.Example 1 : Genome Assembly

[0231] A sample was obtained from a subject and subjected to an OmniC library preparation to generate a plurality of sequence reads. OmniC libraries have several useful properties that are relevant to the genome assembly. The Omni C library can have more uniform coverage of the genome than other proximity -ligation libraries like restriction-enzyme based Hi-C or microC, thereby allowing for a more comprehensive sampling of the genome to observe allelic -variants and their linkage relationships.

[0232] A reference panel of genomes was used as a database, such as a the Human Pangenome Reference Consortium. From the database, a pre-compiled dataset comprised of a set of bi-allelic, mappable single-nucleotide variants (SNPs) and their positions on each assembled genome in theAttorney Docket No. 45269-750.601 reference panel was generated. The set of SNPs was chosen such that each SNP satisfied the following criteria: (1) bi-allelic; (2) exists in a k-mer context devoid of other variation; and (3) exists in a k-mer context that is single-copy in all reference panel human genomes. These SNPs were used to generate k- mers that were centered at the position of the SNP, in which every assembled human genome haplotype should contain one or the other (but not both) of the possible k-mers that differ only by the base at the center of the k-mer, (e.g., the SNP position). Once these SNPs and their genome -context k-mers were selected, each haplotype chromosome was reduced to an order list of these k-mers.

[0233] The OmniC library was analyzed against the set of reference k-mers. Omni-C read pairs were analyzed for those that contained two reference k-mers, which indicated linkage information between two k-mers. Based on the linkage of k-mers identified in the Omni-C read pairs, an ordered list of k-mers that represented the genome was generated. This analysis can be performed independent of mapping any data to any reference genome and does not require that any mapping be performed. This analysis can use graph data structure to infer and generate the order of the reference SNPs from observed data.

[0234] Once the OmniC data was analyzed and a linked set of reference kmers was generated, the linked list and the reference genome panel were used to construct the first-pass genome haplotype genome assemblies. Each reference genome in the panel was represented as an ordered list of reference k-mers. Similarly, the genome to be assembled was also then represented as an ordered list of reference k-mers.

[0235] The ordered list was used to determine the reference genome from which to copy at each section of each haplotype of the to-be-assembled genome. An alignment algorithm (e.g., a dynamic programming local alignment algorithm) was used to find an aligned segment that was the highest scoring (large & with few or no mismatches). The genomic segment represented by that highest scoring alignment was then included in the final assembly. The next highest scoring alignment that did not overlap with the segment already selected was then identified and included in the assembly. The algorithm proceeded until each region of the genome was represented. At the end of this process, the result was an assembly of the target genome that was a simple composite of the best corresponding representations of that target genome from the reference panel.

[0236] A “polishing” step can be performed on the assembled genome. The OmniC data may be mapped to the newly assembled haplotype-resolved genome. From the mapping and aligning, variants can be discovered that are present in the target individual but were not represented in the best reference panel individual. This method can minimize potential reference bias in variant discovery. Since the reference genome may be the newly assembled genome and not an unrelated reference, there may be significantly less reference bias, because the “reference genome” may have a higher similarity to the final assembled genome.Example 2: Comparison of Data Scaffolding Methods

[0237] The data scaffolds of the methods and systems described herein were compared to Omni-C data in HiRise (https: / / bio.tools / hirise, incorporated herein by reference in its entirety). For plant, insect, andAttorney Docket No. 45269-750.601 fish assemblies, either the methods herein or Omni-C data were utilized for scaffolding. When tested, all produced essentially similar assembly stats.

[0238] Data from the methods described herein were also validated in YaHS (Y et another Hi-C Scaffolding tool, Zhou et al. 2022 Bioinformatics Volume 39, Issue 1, herein incorporated by reference) for scaffolding. YaHS is an open source package that was written by the Sanger team and adopted by VGP. YaHS runtime may be significantly faster than HiRise, with YaHS taking minutes to hours while HiRise can run hours to days.

[0239] It was found that YaHS performed better with fragmented genomes than HiRise. Table 2 displays these results. There was equivalent performance on a HiFi input genome for the sand bass sample for HiRise and YaHS. The hops sample had improved stats with YaHS. For the barley sample, YaHS significantly outperformed HiRise, with no need for manual correction. Approximately 40X coverage was achieved.Table 2: Comparison between YaHS and HiRise for Omni-C and the methods described herein.

[0240] Additionally, haplotype-resolved assemblies for an insect and a fish were generated with approximately 40X coverage using the methods and systems described herein, with no manual correction required. FIG. 4 displays contact maps and quality scores for the fish sample.Example 3: Assay for Structural Variation

[0241] In an example, the methods and systems herein were able to capture SVs at 10X genomic coverage with more sensitivity than whole genome sequencing (WGS) at 30-80X coverage. This assay leveraged existing Illumina platforms and did not require additional platforms or equipment. The methods and systems herein were able to capture structural variation missed by WGS and RNA-seq. FIG. 5 illustrates structural variation captured by the methods and systems described herein that was missed by WGS and RNA-seq. For example, the Tloc(3;12) variation was detected by the methods andAttorney Docket No. 45269-750.601 systems described herein, but not by WGS or RNA-seq. FIG. 6 illustrates an output screen for an assay using the methods described herein. Assay information, a genomic summary, a summary statement, and information about the SNVs or indels are shown.Example 4: Tumor / Normal Mixing Experiment for Investigation of Limit of Detection and Exome Capabilities

[0242] FIG. 13 displays results of an HCC1187 tumor / normal mixing experiment with exome sequencing and a CNV xGen™ panel. Different libraries were made with samples with different percentages of tumor tissue relative to normal tissue (0%, 1%, 5%, 10%, 20%, and 100%). In the table on the left, for each library, the percent of that sample that was tumor, the total number of reads, the total number of non-duplicate reads, and the percent of reads that were duplicates are shown. In the plot on the right, exome capture coverage plots for each library are shown.

[0243] FIG. 14 shows VAF distributions and limit of detection plots for the exome sequencing / CNV xGen™ experiment. It was found that the limit of detection was 20% tumor content at approximately 200x sequencing coverage.

[0244] By comparison, FIG. 15 shows that exome capture with the methods and systems described herein detected structural variations down to 20% tumor fraction and below. On the left, simulated whole genome sequencing tumornormal mixing experiments are shown. On the right, real-life exome tumornormal mixing experiments with the methods and systems described herein are shown. The data indicates that the exome capture method of the methods and systems described herein retained sensitivity down to smaller tumor percentages than the WGS method. Structural variant signal was still present below 10% tumor content.

[0245] Other tools may not be able to detect structural variants below 5% tumor content. However, the signal may still be observable. For example, in FIG. 16, the arrow indicates a small signal observable on the matrix plot of the 5% tumor sample.

[0246] Transfer learning neural networks were also used to validate structural variant calls. Resnetl8 was trained on a 1.2 million image ImageNet dataset. Resnetl8 had 12 million parameters. VGG16 was trained on a 1.2 million image ImageNet dataset. VGG16 had 138 million parameters. Thousands of structural variant images were input into these neural networks to validate the results.Example 5: Testing Methods with Clinical Non-Actionable Samples

[0247] In this example, a metastatic ovarian cancer sample with no known oncogenic driver was tested. The sample was a high-grade papillary serous cystadenocarcinoma, T3cNx, with a tumor fraction of 30% (by pathology). Three different methods were tested: xgen Exome with CNV using the methods described herein; whole genome shotgun sequencing; and whole genome with the methods described herein.

[0248] It was found that the sample had multiple structural alterations of unknown significance (FIG. 17). The overlap between whole-genome shotgun sequencing and the methods described herein was 94% for SNVs and 93.7% for indels.Attorney Docket No. 45269-750.601

[0249] Additionally, copy number alterations were discernible using a combination of exome and CNV probes. It was also possible to detect oncogenic variants. FIG. 18 shows a copy number ratio plot for the different chromosomes. The arrows indicate different copy number amplifications and deletions.

[0250] FIG. 19 shows a copy number ratio plot for part of chromosome 20. The focal chromosome 20 amplification had known oncogenes as well as poor prognostic markers. The amplified region contained at least two known genes, PTPN1 and SALL4.

[0251] FIG. 20 shows that the ovarian sample had a series of connected interchromosomal insertions. Additionally, as shown by FIG. 21, the chromosome 19 insertion sequences contained poor prognostic markers, such as XAB2, CCL25, SPINT2, and YIF1B. The methods and systems described herein may be used to detect chromoplexy.Example 6: Pangenome-based Personalized Genomes and Cancer Reconstruction

[0252] In this example, a single short-read library produced with the methods and systems described herein enabled reconstruction of the cancer genome specific to an individual, and also enabled curation of molecular mechanisms for cancer state. FIG. 22 shows a schematic of the method used: (1) a sample was processed using the methods and systems described herein; (2) data from the cross-linking method was processed to create a personalized diploid genome; and (3) this personalized diploid genome was used for personalized genome analyses, such as tumor-only somatic variant identification, cancer genome reconstruction, and haplotype-specific chromatin conformation analyses.Example 7 : Improved Haplotypes through Genome Imputation

[0253] In this example, haplotype-resolved complete genome assemblies from Human Pangenome Reference Consortium (HPRC) were used to impute complete, genome -wide haplotypes for a sample. A schematic of this is shown in FIG. 30. The approach used sets of bi-allelic k-mers that linked sequences derived from reference HPRC haplotypes to stitch together an imputed haplotype for a given sample.

[0254] An example workflow for a proof-of-concept study of an ovarian tumor is shown in FIG. 31. An ovarian tumor sample was processed with the methods herein to generate proximity ligation data, which was converted into FASTQ data. This FASTQ data was processed with KMC to produce a k-mer database (KFF). This k-mer database was then processed with vg and an HPRC pangenome graph to produce a personalized diploid subgraph for the individual. This personalized diploid subgraph was then processed using vg into two unphased haplotypes for the individual (FASTA files). These haplotypes were then aligned and processed to determine variants. Coding exons and a position transmap were also generated from the unphased haplotypes and analyzed.

[0255] FIG. 32A shows differences between the two unphased haplotypes for the largest 23 contigs. The cumulative size delta of the Hapl - Hap2 differences was 2,144, 586 base pairs. FIG. 32B shows the total size, number of contigs, and total size of largest 23 contigs for the two unphased haplotypes, the diploid genotype (Unphased Hapl + Unphased Hap2), and the reference genome (hg38). 98.9% of the haplotypes and genome were found in the 23 largest contigs, with the remaining 1.1% found in smaller contigs (970 bp - 23 Mb) that needed to be connected to the primary contigs.Attorney Docket No. 45269-750.601

[0256] There can be complications when working with personalized genomes. For example, annotation databases may need to be mapped from the reference genome coordinates (hg3i8 and / or T2T-CHM13) to their locations within personalized genomes. These annotations may include RefSeq transcripts (e.g., genes or exons) or clinically-important variants (e.g., KRAS p.G12V). FIG. 33 shows a schematic of different locations that KRAS Exon 1 could be mapped to in the two different haplotypes. Within a diploid genome, analyzing data from related sequences on the two parental chromosomes (e.g., both copies of the KRAS gene) may also require coordinate translation. A positional transmap derived from chromosomal alignments may help translate coordinates between different haplotypes / references in their linearized form. The vg toolset may natively enable translation between reference genomes, haplotypes, and personalized genomes within the pangenome graph.

[0257] FIG. 34 shows an example alignment for a personalized diploid genome for an ovarian tumor. The alignment shows a 5kb window on Hap 2 - Contig #10 (chrl2). However, this data appears to be missing many reads. FIG. 35 shows the same alignment but with reads of mapping quality = 0 (MQ0) also shown. BWA frequently sets reads that ambiguously map to more than one location in the genome as MQ0. These reads belonged to homologous regions in both chromosomes that are identical (FIG. 36). As shown in FIG. 37, the alignment where the reads appear to be “missing” represent regions of the genome that are different between the two copies of a chromosomal region (e.g., heterozygous / hemizygous). The arrows point to confirmed heterozygous variants in the ovarian tumor sample.

[0258] FIG. 38 shows how putative somatic variants were identified. For Haplotype #1, 387,073 variants were found in the homozygous region, 30,895 variants were found near the site of variation, and 4,645 variants were found in the Hapl-only region. No annotation database (e.g., dbSNP) was needed to classify approximately 90% of the variants as germline.

[0259] FIG. 39 shows phasing information within the diploid alignments. The darker-shaded regions indicate regions of Haplotype 1 with more overall linkage to Haplotype 2. 19% of read islands on Contig 10 had more overall linkage to the opposite haplotype.

[0260] While various embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments described herein may be employed. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

Attorney Docket No. 45269-750.601CLAIMSWHAT IS CLAIMED IS:

1. A method of genome assembly comprising:(a) obtaining(i) a plurality of sequencing reads derived from a subject’s genome, and(ii) a set of ordered lists of k-mers derived from a panel of reference genomes, wherein each k-mer comprises a nucleotide sequence comprising a variant;(b) identifying sequence reads comprising a pair of k-mers of one of the set of ordered lists of k-mers;(c) generating an assembled genome by aligning the pair of k-mers of the sequence reads identified in (b) to k-mers in the set of ordered lists of k-mers.

2. The method of claim 1, wherein (c) comprises identifying an ordered list of k-mers from the set of ordered lists of k-mers that comprises the most k-mers from the plurality of sequence reads.

3. The method of claim 2, wherein (c) further comprises identifying a sequence from the panel of reference genomes that corresponds with the ordered list of k-mers that aligns with k-mers of the sequence reads identified in (b), thereby generating an identified reference sequence.

4. The method of claim 3, wherein (c) further comprises repeating said aligning and said identifying sequences of said reference genome for additional subsets of said ordered list of k-mers to determine a plurality of identified reference sequences.

5. The method of claim 4, wherein (c) further comprises combining said plurality of identified reference sequence to generate said assembled genome.

6. The method of claim 1, wherein (c) comprises denoting the pair of k-mers as linked k-mers.

7. The method of claim 6, wherein the denoting comprises generating a graph data structure wherein said k-mers of the pair of k-mers are connected nodes.

8. The method of any one of claims 1 to 7, wherein each ordered list of k-mers comprises a plurality of k-mers of a chromosome.

9. The method of any one of claims 1 to 8, wherein the variants are bi-allelic variants.

10. The method of any one of claims 1 to 9, wherein each k-mer of said set of ordered list of k-mers comprises no more than one variant.11 . The method of any one of claims 1 to 10, wherein each k-mer sequence of said set of k-mer sequences is present at no more than one time in each reference genome.

12. The method of any one of claims 1 to 11, further comprising(d) assembling the sequences of the plurality of sequencing reads into contigs; and(e) aligning the contigs to the assembled genome to identify additional variants.

13. The method of any one of claims 1 to 12, wherein said assembled genome comprises sequences derived from said panel of reference genomes.Attorney Docket No. 45269-750.60114. The method of any one of claims 1 to 13, further comprising, prior to (a), sequencing a subject’s genome to generate the plurality of sequencing reads.

15. The method of any one of claims 1 to 14, wherein said sequencing reads are derived from nucleic acids subjected to a proximity-ligation reaction.

16. The method of any one of claims 1 to 15, wherein said plurality of sequencing reads is generated by cross-linking a sample derived from said subject, fragmenting nucleic acids in said sample to produce nucleic acid fragments, ligating said nucleic acid fragments to produce ligated nucleic acid fragments, reversing crosslinks, and sequencing said ligated nucleic acid fragments.

17. The method of claim 16, wherein said fragmenting comprises contacting the nucleic acids with a restriction endonuclease, a micrococcal endonuclease, a transposase or DNase I.

18. The method of claim 16, wherein said fragmenting comprises contacting the nucleic acids with DNase I.

19. A system for genome assembly, the system comprising: at least one system memory comprising (i) a plurality of sequencing reads derived from a subject’s genome, and (ii) a set of k-mer sequences derived from a reference genome panel, wherein each k-mer comprises a nucleotide sequence comprising a variant; and a computer memory coupled to said at least one system memory and programmed to:(a) identify sequence reads comprising a pair of k-mers of one of the ordered lists of k-mers;(b) generating an assembled genome by aligning the k-mers of the sequence reads identified in (a) to k-mers of the ordered lists of k-mers.

20. A computer readable medium or mediums for detecting a presence or an absence of cancer in a subject, wherein said computer readable medium or mediums comprising a set of instructions recorded thereon, wherein, when said set of instructions are executed by a processor, the following steps are implemented:(a) obtaining(i) a plurality of sequencing reads derived from a subject’s genome, and(ii) a set of ordered lists of k-mers derived from a panel of reference genomes, wherein each k-mer comprises a nucleotide sequence comprising a variant;(b) identifying sequence reads comprising a pair of k-mers of one of the ordered lists of k- mers;(c) generating an assembled genome by aligning the k-mers of the sequence reads identified in (b) to k-mers of the ordered lists of k-mers.